Numpy Reminders:

Let's have a look at some of the most basic numpy concepts that we will use in the more advanced sections

In [1]:
import numpy as np
import datetime as datetime
from matplotlib import dates
import pandas as pd

Numpy Creating data

In order to test our functions, being able to create "random" data is the key

In [3]:
#Create a range based on your input. From 0 to 10, not including 10 and step size parameter 2
np.arange(0,10,2)
Out[3]:
array([0, 2, 4, 6, 8])
In [4]:
#Return evenly space 5 floats from 0 to 10, including 10
np.linspace(0,10,5)
Out[4]:
array([ 0. ,  2.5,  5. ,  7.5, 10. ])
In [5]:
#return 2 radom floats from a uniform distribution (all numbers have the same probability)
np.random.rand(2)
Out[5]:
array([0.16419349, 0.2470291 ])
In [6]:
#return 2 radom floats from a normal distribution with mean = 0 and std = 1
np.random.randn(2)
Out[6]:
array([-0.39035138, -0.41370209])
In [7]:
#return 2 radom floats from a normal distribution with mean = 3 and std = 1
np.random.normal(3,1,2)
Out[7]:
array([5.28203783, 3.48991945])
In [8]:
#Generate the same random numbers by setting a seed
#The number in the seed is irrelevant
#It will only work if we are on the same cell
np.random.seed(1)
print(np.random.rand(1))

np.random.seed(2)
print(np.random.rand(1))

np.random.seed(1)
print(np.random.rand(1))
[0.417022]
[0.4359949]
[0.417022]
In [9]:
#Creating a matrix from an array
#First create an array
arr = np.arange(4)

#Then reshape it
matrix = arr.reshape(2,2)
In [10]:
#return the min and max values of an array/matrix

print(arr.max())
print(matrix.min())
3
0

Numpy Indexing and Selection

In [11]:
#Remember if you want to work with array copies use:
arr_copy = arr.copy()

#Otherwise you will be always editing the original array as well,
#since the new object created is pointing to the original object
In [12]:
#Create a matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(matrix)
#Return the item in the column 1 and row 1
matrix[1][1]
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Out[12]:
5
In [13]:
#Return (from the matrix) until the 2 row (not included)
matrix[:2]
Out[13]:
array([[1, 2, 3],
       [4, 5, 6]])
In [14]:
#Until row 2 and from column 1
matrix[:2,1:]
Out[14]:
array([[2, 3],
       [5, 6]])
In [15]:
#Return a filtered array whose values are lower than 2
print("Original array: ")
print(arr)
print("Result: ") 
print(arr[arr<2])
Original array: 
[0 1 2 3]
Result: 
[0 1]

Numpy Operations

Skipping the basic ones

In [16]:
#Sum of all the values of the columns
print(matrix)
matrix.sum(axis=0)
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Out[16]:
array([12, 15, 18])
In [17]:
#Sum of all the values of the rows
print(matrix)
matrix.sum(axis=1)
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Out[17]:
array([ 6, 15, 24])

Pandas Reminders

In [18]:
import pandas as pd
In [19]:
#Create a Matrix, which will be used for the dataframe creation
rand_mat = np.random.rand(5,4)
In [20]:
#Create dataframe
df = pd.DataFrame(data=rand_mat, index = 'A B C D E'.split(), columns = "R P U I".split())
In [21]:
#Drop row
df.drop("A")
Out[21]:
R P U I
B 0.092339 0.186260 0.345561 0.396767
C 0.538817 0.419195 0.685220 0.204452
D 0.878117 0.027388 0.670468 0.417305
E 0.558690 0.140387 0.198101 0.800745
In [22]:
#Drop column
df.drop("R",axis=1)
Out[22]:
P U I
A 0.000114 0.302333 0.146756
B 0.186260 0.345561 0.396767
C 0.419195 0.685220 0.204452
D 0.027388 0.670468 0.417305
E 0.140387 0.198101 0.800745
In [23]:
#Return series of the row A
df.loc["A"]
Out[23]:
R    0.720324
P    0.000114
U    0.302333
I    0.146756
Name: A, dtype: float64
In [24]:
#Return series of the row number 2
df.iloc[2]
Out[24]:
R    0.538817
P    0.419195
U    0.685220
I    0.204452
Name: C, dtype: float64
In [25]:
#Filtering by value. Filter all rows that are smaller than 0.3 in column I
df[df["I"]>0.3]
Out[25]:
R P U I
B 0.092339 0.186260 0.345561 0.396767
D 0.878117 0.027388 0.670468 0.417305
E 0.558690 0.140387 0.198101 0.800745
In [26]:
#Return a unique array of the column R
df["R"].unique()
Out[26]:
array([0.72032449, 0.09233859, 0.53881673, 0.87811744, 0.55868983])
In [27]:
#Return the number of unique items of the array of the column R
df["R"].nunique()
Out[27]:
5
In [28]:
#Apply a function to a column

df["R"].apply(lambda a: a+1)
Out[28]:
A    1.720324
B    1.092339
C    1.538817
D    1.878117
E    1.558690
Name: R, dtype: float64

Pandas Viz Reminders

In [29]:
#Display plots directly in jupyter
%matplotlib inline
In [30]:
#Import data
df1 = pd.read_csv("./original/TSA_COURSE_NOTEBOOKS/03-Pandas-Visualization/df1.csv",index_col=0)
df2 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/03-Pandas-Visualization/df2.csv')

Plot Types

There are several plot types built into pandas; most of them are statistical by nature:

df.plot.hist()     histogram
df.plot.bar()      bar chart
df.plot.barh()     horizontal bar chart
df.plot.line()     line chart
df.plot.area()     area chart
df.plot.scatter()  scatter plot
df.plot.box()      box plot
df.plot.kde()      kde plot
df.plot.hexbin()   hexagonal bin plot
df.plot.pie()      pie chart

You can also call specific plots by passing their name as an argument, as with df.plot(kind='area').

Let's start going through them! First we'll look at the data:

Histograms

This is one of the most commonly used plots. Histograms describe the distribution of continuous data by dividing the data into "bins" of equal width, and plotting the number of values that fall into each bin. [reference]

In [31]:
df1['A'].plot.hist();

We can add settings to do things like bring the x- and y-axis values to the edge of the graph, and insert lines between vertical bins:

In [32]:
df1['A'].plot.hist(edgecolor='k').autoscale(enable=True, axis='both', tight=True)

You can use any matplotlib color spec for edgecolor, such as 'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w', or the string representation of a float value for shades of grey, such as '0.5'

For autoscale the axis can be set to 'x', 'y' or 'both'

We can also change the number of bins (the range of values over which frequencies are calculated) from the default value of 10:

In [33]:
df1['A'].plot.hist(bins=40, edgecolor='k').autoscale(enable=True, axis='both', tight=True)

You can also access an histogram like this:

In [34]:
df1['A'].hist();

Barplots

Barplots are similar to histograms, except that they deal with discrete data, and often reflect multiple variables. [reference]

In [35]:
df2.plot.bar();
In [36]:
df2.plot.bar(stacked=True);
In [37]:
# USE .barh() TO DISPLAY A HORIZONTAL BAR PLOT
df2.plot.barh();

Line Plots

Line plots are used to compare two or more variables. By default the x-axis values are taken from the index. [reference]

Line plots happen to be the default pandas plot. They are accessible through df.plot() as well as df.plot.line()

In [38]:
df2.plot.line(y='a',figsize=(12,3),lw=2);
In [39]:
# Use lw to change the size of the line

df2.plot.line(y=['a','b','c'],figsize=(12,3),lw=3);

Area Plots

Area plots represent cumulatively stacked line plots where the space between lines is emphasized with colors. [reference]

In [40]:
df2.plot.area();

It often helps to mute the colors by passing an alpha transparency value between 0 and 1.

In [41]:
df2.plot.area(alpha=0.4);

To produce a blended area plot, pass a stacked=False argument:

In [42]:
df2.plot.area(stacked=False, alpha=0.4);

Scatter Plots

Scatter plots are a useful tool to quickly compare two variables, and to look for possible trends. [reference]

In [43]:
df1.plot.scatter(x='A',y='B');

Scatter plots with colormaps

You can use c to color each marker based off another column value. Use cmap to indicate which colormap to use.
For all the available colormaps, check out: http://matplotlib.org/users/colormaps.html

In [44]:
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm');

Scatter plots with sized markers

Alternatively you can use s to indicate marker size based off another column. The s parameter needs to be an array, not just the name of a column:

In [45]:
df1.plot.scatter(x='A',y='B',s=df1['C']*50);
/home/rangelrey/.local/share/virtualenvs/timeseriesanalysis-xa2U-udI/lib/python3.6/site-packages/matplotlib/collections.py:886: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

The warning above appeared because some df1['C'] values are negative. We can fix this finding the minimum value, writing a function that adds to each value, and applying our function to the data with .apply(func).

Also, these data points have a lot of overlap. We can address this issue by passing in an alpha blending value between 0 and 1 to make markers more transparent.

In [46]:
def add_three(val):
    return val+3

df1.plot.scatter(x='A',y='B',s=df1['C'].apply(add_three)*45, alpha=0.2);

BoxPlots

Box plots, aka "box and whisker diagrams", describe the distribution of data by dividing data into quartiles about the mean.
Look here for a description of boxplots. [reference]

In [47]:
df2.boxplot();

Boxplots with Groupby

To draw boxplots based on groups, first pass in a list of columns you want plotted (including the groupby column), then pass by='columname' into .boxplot(). Here we'll group records by the 'e' column, and draw boxplots for the 'b' column.

In [48]:
df2[['b','e']].boxplot(by='e', grid=False);

In the next section on Customizing Plots we'll show how to change the title and axis labels.

Kernel Density Estimation (KDE) Plot

In order to see the underlying distribution, which is similar to an histogram. These plots are accessible either through df.plot.kde() or df.plot.density() [reference]

In [49]:
df2['a'].plot.kde();
In [50]:
df2.plot.density();

Hexagonal Bin Plot

Useful for Bivariate Data, alternative to scatterplot. [reference]

In [51]:
# FIRST CREATE A DATAFRAME OF RANDOM VALUES
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])

# MAKE A HEXAGONAL BIN PLOT
df.plot.hexbin(x='a',y='b',gridsize=25,cmap='Oranges');

Customizing Pandas Plots

In this section we'll show how to control the position and appearance of axis labels and legends.
For more info on the following topics visit https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

Colors, Widths and Linestyles

The pandas .plot() method takes optional arguments that allow you to control linestyles, colors, widths and more.

In [52]:
# START WITH A SIMPLE LINE PLOT
df2['c'].plot(figsize=(8,3));
In [53]:
df2['c'].plot.line(ls='-.', c='r', lw='4', figsize=(8,3));

For more on linestyles, click here.

Adding Titles and Axis Labels

In [54]:
# START WITH A SIMPLE MULTILINE PLOT
df2.plot(figsize=(8,3));

Object-oriented plotting

When we call df.plot(), pandas returns a matplotlib.axes.AxesSubplot object. We can set labels on that object so long as we do it in the same jupyter cell. Setting an autoscale is done the same way.

In [55]:
title='Custom Pandas Plot'
ylabel='Y-axis data'
xlabel='X-axis data'

ax = df2.plot(figsize=(8,3),title=title)
ax.set(xlabel=xlabel, ylabel=ylabel)
ax.autoscale(axis='x',tight=True);

Plot Legend Placement

Pandas tries to optimize placement of the legend to reduce overlap on the plot itself. However, we can control the placement ourselves, and even place the legend outside of the plot. We do this through the .legend() method. [reference]

For starters we can pass a location code. .legend(loc=1) places the legend in the upper-right corner of the plot.
Alternatively we can pass a location string: .legend(loc='upper right') does the same thing.

LOCATION CODELOCATION STRING
0'best'
1'upper right'
2'upper left'
3'lower left'
4'lower right'
5'right'
6'center left'
7'center right'
8'lower center'
9'upper center'
10'center'
In [56]:
ax = df2.plot(figsize=(8,3))
ax.autoscale(axis='x',tight=True)
ax.legend(loc=1);

We can pass a second argument, bbox_to_anchor that treats the value passed in through loc as an anchor point, and positions the legend along the x and y axes based on a two-value tuple.

In [57]:
# FIRST, PLACE THE LEGEND IN THE LOWER-LEFT
ax = df2.plot(figsize=(8,3))
ax.autoscale(axis='x',tight=True)
ax.legend(loc=3);
In [58]:
# NEXT, MOVE THE LEGEND A LITTLE TO THE RIGHT AND UP
ax = df2.plot(figsize=(8,3))
ax.autoscale(axis='x',tight=True)
ax.legend(loc=3, bbox_to_anchor=(0.1,0.1));

Placing the Legend Outside the Plot

In the above plot we passed (0.1,0.1) as our two-item tuple. This places the legend slightly to the right and slightly upward.
To place the legend outside the plot on the right-hand side, pass a value greater than or equal to 1 as the first item in the tuple.

In [59]:
ax = df2.plot(figsize=(8,3))
ax.autoscale(axis='x',tight=True)
ax.legend(loc=3, bbox_to_anchor=(1.0,0.1));

Pandas Datetime Index

We'll usually deal with time series as a datetime index when working with pandas dataframes. Fortunately pandas has a lot of functions and methods to work with time series!
For more on the pandas DatetimeIndex visit https://pandas.pydata.org/pandas-docs/stable/timeseries.html

Ways to build a DatetimeIndex:

In [60]:
# THE WEEK OF JULY 8TH, 2018
idx = pd.date_range('7/8/2018', periods=7, freq='D')
idx
Out[60]:
DatetimeIndex(['2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11',
               '2018-07-12', '2018-07-13', '2018-07-14'],
              dtype='datetime64[ns]', freq='D')
In [61]:
idx = pd.to_datetime(['Jan 01, 2018','1/2/18','03-Jan-2018',None])
idx
Out[61]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', 'NaT'], dtype='datetime64[ns]', freq=None)
In [62]:
# Create a NumPy datetime array
some_dates = np.array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[D]')
some_dates
Out[62]:
array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[D]')
In [63]:
pd.to_datetime(['2/1/2018','3/1/2018'],format='%d/%m/%Y')
Out[63]:
DatetimeIndex(['2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)

Time Resampling

Note: the above code is a faster way of doing the following:

df = pd.read_csv('../Data/starbucks.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)

resample()

A common operation with time series data is resampling based on the time series index. Let's see how to use the resample() method. [reference]

In [64]:
# Index_col indicates that the index will be the column called 'Date'
# parse_dates, transforms the strings into datetime format

df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/starbucks.csv', index_col='Date', parse_dates=True)
In [65]:
# Our index
df.index
Out[65]:
DatetimeIndex(['2015-01-02', '2015-01-05', '2015-01-06', '2015-01-07',
               '2015-01-08', '2015-01-09', '2015-01-12', '2015-01-13',
               '2015-01-14', '2015-01-15',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='Date', length=1006, freq=None)

When calling .resample() you first need to pass in a rule parameter, then you need to call some sort of aggregation function.

The rule parameter describes the frequency with which to apply the aggregation function (daily, monthly, yearly, etc.)
It is passed in using an "offset alias" - refer to the table below. [reference]

The aggregation function is needed because, due to resampling, we need some sort of mathematical rule to join the rows (mean, sum, count, etc.)

TIME SERIES OFFSET ALIASES
ALIASDESCRIPTION
Bbusiness day frequency
Ccustom business day frequency (experimental)
Dcalendar day frequency
Wweekly frequency
Mmonth end frequency
SMsemi-month end frequency (15th and end of month)
BMbusiness month end frequency
CBMcustom business month end frequency
MSmonth start frequency
SMSsemi-month start frequency (1st and 15th)
BMSbusiness month start frequency
CBMScustom business month start frequency
Qquarter end frequency
intentionally left blank
ALIASDESCRIPTION
BQbusiness quarter endfrequency
QSquarter start frequency
BQSbusiness quarter start frequency
Ayear end frequency
BAbusiness year end frequency
ASyear start frequency
BASbusiness year start frequency
BHbusiness hour frequency
Hhourly frequency
T, minminutely frequency
Ssecondly frequency
L, msmilliseconds
U, usmicroseconds
Nnanoseconds

Let's resample our dataframe, by using rule "A", which is year and frecuency and aggregate it with the mean

In [66]:
# Yearly Means
df.resample(rule='A').mean()
Out[66]:
Close Volume
Date
2015-12-31 50.078100 8.649190e+06
2016-12-31 53.891732 9.300633e+06
2017-12-31 55.457310 9.296078e+06
2018-12-31 56.870005 1.122883e+07

Resampling rule 'A' takes all of the data points in a given year, applies the aggregation function (in this case we calculate the mean), and reports the result as the last day of that year.

In [67]:
title = 'Monthly Max Closing Price for Starbucks'
df['Close'].resample('M').max().plot.bar(figsize=(16,6), title=title,color='#1f77b4');

Time Shifting

Sometimes you may need to shift all your data up or down along the time series index. In fact, a lot of pandas built-in methods do this under the hood. This isn't something we'll do often in the course, but it's definitely good to know about this anyways!

In [68]:
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/starbucks.csv',index_col='Date',parse_dates=True)

.shift() forward

This method shifts the entire date index a given number of rows, without regard for time periods (months & years).
It returns a modified copy of the original DataFrame.

In other words, it moves down all the rows down or up.

In [69]:
# We move down all the rows
df.shift(1).head()
Out[69]:
Close Volume
Date
2015-01-02 NaN NaN
2015-01-05 38.0061 6906098.0
2015-01-06 37.2781 11623796.0
2015-01-07 36.9748 7664340.0
2015-01-08 37.8848 9732554.0
In [70]:
# NOTE: You will lose that last piece of data that no longer has an index!
df.shift(1).tail()
Out[70]:
Close Volume
Date
2018-12-24 61.39 23524888.0
2018-12-26 60.56 6323252.0
2018-12-27 63.08 16646238.0
2018-12-28 63.20 11308081.0
2018-12-31 63.39 7712127.0

Shifting based on Time Series Frequency Code

We can choose to shift index values up or down without realigning the data by passing in a freq argument.
This method shifts dates to the next period based on a frequency code. Common codes are 'M' for month-end and 'A' for year-end.
Refer to the Time Series Offset Aliases table from the Time Resampling lecture for a full list of values, or click here.

In [71]:
# Shift everything to the end of the month
df.shift(periods=1, freq='M').head()
Out[71]:
Close Volume
Date
2015-01-31 38.0061 6906098
2015-01-31 37.2781 11623796
2015-01-31 36.9748 7664340
2015-01-31 37.8848 9732554
2015-01-31 38.4961 13170548

Rolling and Expanding

A common process with time series is to create data based off of a rolling mean. The idea is to divide the data into "windows" of time, and then calculate an aggregate function for each window. In this way we obtain a simple moving average. Let's show how to do this easily with pandas!

In [72]:
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/starbucks.csv', index_col='Date', parse_dates=True)
In [73]:
df['Close'].plot(figsize=(12,5)).autoscale(axis='x',tight=True);

Now let's add in a rolling mean! This rolling method provides row entries, where every entry is then representative of the window.

In [74]:
# 7 day rolling mean
df.rolling(window=7).mean().head(15)
Out[74]:
Close Volume
Date
2015-01-02 NaN NaN
2015-01-05 NaN NaN
2015-01-06 NaN NaN
2015-01-07 NaN NaN
2015-01-08 NaN NaN
2015-01-09 NaN NaN
2015-01-12 37.616786 1.238222e+07
2015-01-13 37.578786 1.297288e+07
2015-01-14 37.614786 1.264020e+07
2015-01-15 37.638114 1.270624e+07
2015-01-16 37.600114 1.260380e+07
2015-01-20 37.515786 1.225634e+07
2015-01-21 37.615786 9.868837e+06
2015-01-22 37.783114 1.185335e+07
2015-01-23 38.273129 1.571999e+07
In [75]:
df['Close'].plot(figsize=(12,5)).autoscale(axis='x',tight=True)
df.rolling(window=30).mean()['Close'].plot();

Expanding

Instead of calculating values for a rolling window of dates, what if you wanted to take into account everything from the start of the time series up to each point in time? For example, instead of considering the average over the last 7 days, we would consider all prior data in our expanding set of averages.

In [76]:
# Optional: specify a minimum number of periods to start from
df['Close'].expanding(min_periods=30).mean().plot(figsize=(12,5));

Visualizing Time Series Data

In [77]:
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/starbucks.csv',index_col='Date',parse_dates=True)

First we'll create a line plot that puts both 'Close' and 'Volume' on the same graph.
Remember that we can use df.plot() in place of df.plot.line()

In [78]:
df.index = pd.to_datetime(df.index)
In [79]:
df.plot();

Adding a title and axis labels

In [80]:
title='Starbucks Closing Stock Prices'
ylabel='Closing Price (USD)'
xlabel='Closing Date'

ax = df['Close'].plot(figsize=(12,6),title=title)
ax.autoscale(axis='x',tight=True) 
ax.set(xlabel=xlabel, ylabel=ylabel);

Thanks to the date index, we can make a selection like the following:

In [81]:
df['Close']['2017-01-01':'2017-03-01']
Out[81]:
Date
2017-01-03    53.1100
2017-01-04    53.7241
2017-01-05    54.1750
2017-01-06    54.8179
2017-01-09    55.8446
2017-01-10    55.5376
2017-01-11    55.7487
2017-01-12    55.6815
2017-01-13    55.5088
2017-01-17    55.6527
2017-01-18    56.0845
2017-01-19    55.5472
2017-01-20    55.3265
2017-01-23    55.4224
2017-01-24    56.0749
2017-01-25    56.3244
2017-01-26    56.0941
2017-01-27    53.8488
2017-01-30    53.6377
2017-01-31    52.9852
2017-02-01    51.7186
2017-02-02    51.6899
2017-02-03    52.8317
2017-02-06    53.4746
2017-02-07    53.2433
2017-02-08    53.2240
2017-02-09    53.7927
2017-02-10    54.1878
2017-02-13    54.0818
2017-02-14    54.5348
2017-02-15    54.8047
2017-02-16    54.6794
2017-02-17    55.2770
2017-02-21    55.4601
2017-02-22    55.4890
2017-02-23    55.5565
2017-02-24    55.4023
2017-02-27    54.7276
2017-02-28    54.8143
2017-03-01    55.0746
Name: Close, dtype: float64

X Limits

There are two ways we can set a specific span of time as an x-axis limit. We can plot a slice of the dataset, or we can pass x-limit values as an argument into df.plot().

The advantage of using a slice is that pandas automatically adjusts the y-limits accordingly.

The advantage of passing in arguments is that pandas automatically tightens the x-axis. Plus, if we're also setting y-limits this can improve readability.

Choosing X Limits by Slice:

In [82]:
# Dates are separated by a colon:
df['Close']['2017-01-01':'2017-03-01'].plot(figsize=(12,4)).autoscale(axis='x',tight=True);

Choosing X Limits by Argument:

In [83]:
# Dates are separated by a comma:
#Let's say we want to display the plot only from the 1st of january until the 1t of march
df['Close'].plot(figsize=(12,4),xlim=['2017-01-01','2017-03-01']);

Now let's focus on the y-axis limits to get a better sense of the shape of the data.
First we'll find out what upper and lower limits to use.

In [84]:
# FIND THE MINIMUM VALUE IN THE RANGE:
df.loc['2017-01-01':'2017-03-01']['Close'].min()
Out[84]:
51.6899

Title and axis labels

Let's add a title and axis labels to our subplot.

REMEMBER: ax.autoscale(axis='both',tight=True) is unnecessary if axis limits have been passed into .plot().
If we were to add it, autoscale would revert the axis limits to the full dataset.
In [85]:
title='Starbucks Closing Stock Prices'
ylabel='Closing Price (USD)'
xlabel='Closing Date'

ax = df['Close'].plot(xlim=['2017-01-04','2017-03-01'],ylim=[51,57],figsize=(12,4),title=title)
ax.set(xlabel=xlabel, ylabel=ylabel);

We can pass arguments into .plot() to change the linestyle and color. Refer to the Customizing Plots lecture from the previous section for more options.

In [86]:
df['Close'].plot(xlim=['2017-01-01','2017-03-01'],ylim=[51,57],ls='--',c='r');

X Ticks

In this section we'll look at how to change the format and appearance of dates along the x-axis. To do this, we'll borrow a tool from matplotlib called dates.

Set the spacing

The x-axis values can be divided into major and minor axes. For now, we'll work only with the major axis and learn how to set the spacing with .set_major_locator(). As you can see in the graph below, the X axis is not beautifully distributed

In [87]:
# CREATE OUR AXIS OBJECT
ax = df['Close'].plot(xlim=['2017-01-01','2017-03-01'],ylim=[51,57])

With set_major_locator we can solve this problem

In [88]:
# CREATE OUR AXIS OBJECT
ax = df['Close'].plot(xlim=['2017-01-01','2017-03-01'],ylim=[51,57])

# REMOVE PANDAS DEFAULT "Date" LABEL
ax.set(xlabel='')

# SET THE TICK LOCATOR AND FORMATTER FOR THE MAJOR AXIS
# byweekday = 0 means Monday

ax.xaxis.set_major_locator(dates.WeekdayLocator(byweekday=0))

Notice that dates are spaced one week apart. The dates themselves correspond with byweekday=0, or Mondays.
For a full list of locator options available from matplotlib.dates visit https://matplotlib.org/api/dates_api.html#date-tickers

Date Formatting

Formatting follows the Python datetime strftime codes.
The following examples are based on datetime.datetime(2001, 2, 3, 16, 5, 6):

CODEMEANINGEXAMPLE
%YYear with century as a decimal number.2001
%yYear without century as a zero-padded decimal number.01
%mMonth as a zero-padded decimal number.02
%BMonth as locale’s full name.February
%bMonth as locale’s abbreviated name.Feb
%dDay of the month as a zero-padded decimal number.03
%AWeekday as locale’s full name.Saturday
%aWeekday as locale’s abbreviated name.Sat
%HHour (24-hour clock) as a zero-padded decimal number.16
%IHour (12-hour clock) as a zero-padded decimal number.04
%pLocale’s equivalent of either AM or PM.PM
%MMinute as a zero-padded decimal number.05
%SSecond as a zero-padded decimal number.06
CODEMEANINGEXAMPLE
%#mMonth as a decimal number. (Windows)2
%-mMonth as a decimal number. (Mac/Linux)2
%#xLong dateSaturday, February 03, 2001
%#cLong date and timeSaturday, February 03, 2001 16:05:06
In [89]:
# USE THIS SPACE TO EXPERIMENT WITH DIFFERENT FORMATS
from datetime import datetime
datetime(2001, 2, 3, 16, 5, 6).strftime("%A, %B %d, %Y  %I:%M:%S %p")
Out[89]:
'Saturday, February 03, 2001  04:05:06 PM'

We use the set_major_formatter to format the display of the date in the plot

In [90]:
ax = df['Close'].plot(xlim=['2017-01-01','2017-03-01'],ylim=[51,57],title='2017 Starbucks Closing Stock Prices')
ax.set(xlabel='')

ax.xaxis.set_major_locator(dates.WeekdayLocator(byweekday=0))
ax.xaxis.set_major_formatter(dates.DateFormatter("%a-%B-%d"))

Major vs. Minor Axis Values

All of the tick marks we've used so far have belonged to the major axis. We can assign another level called the minor axis, perhaps to separate month names from days of the month.

In [91]:
ax = df['Close'].plot(xlim=['2017-01-01','2017-03-01'],ylim=[51,57],rot=0,title='2017 Starbucks Closing Stock Prices')
ax.set(xlabel='')

ax.xaxis.set_major_locator(dates.WeekdayLocator(byweekday=0))
ax.xaxis.set_major_formatter(dates.DateFormatter('%d'))

ax.xaxis.set_minor_locator(dates.MonthLocator())
ax.xaxis.set_minor_formatter(dates.DateFormatter('\n\n%b'))

Adding Gridlines

We can add x and y axis gridlines that extend into the plot from each major tick mark.

In [92]:
ax = df['Close'].plot(xlim=['2017-01-01','2017-03-01'],ylim=[51,57],rot=0,title='2017 Starbucks Closing Stock Prices')
ax.set(xlabel='')

ax.xaxis.set_major_locator(dates.WeekdayLocator(byweekday=0))
ax.xaxis.set_major_formatter(dates.DateFormatter('%d'))

ax.xaxis.set_minor_locator(dates.MonthLocator())
ax.xaxis.set_minor_formatter(dates.DateFormatter('\n\n%b'))

ax.yaxis.grid(True)
ax.xaxis.grid(True)

TIME SERIES

Statsmodels

For Further Reading:

Statsmodels Tutorial:  Time Series Analysis

Let's walk through a very simple example of using statsmodels!

Perform standard imports and load the dataset

For these exercises we'll be using a statsmodels built-in macroeconomics dataset:

US Macroeconomic Data for 1959Q1 - 2009Q3
Number of Observations - 203
Number of Variables - 14
Variable name definitions:
    year      - 1959q1 - 2009q3
    quarter   - 1-4
    realgdp   - Real gross domestic product (Bil. of chained 2005 US$,
                seasonally adjusted annual rate)
    realcons  - Real personal consumption expenditures (Bil. of chained
                2005 US$, seasonally adjusted annual rate)
    realinv   - Real gross private domestic investment (Bil. of chained
                2005 US$, seasonally adjusted annual rate)
    realgovt  - Real federal consumption expenditures & gross investment
                (Bil. of chained 2005 US$, seasonally adjusted annual rate)
    realdpi   - Real private disposable income (Bil. of chained 2005
                US$, seasonally adjusted annual rate)
    cpi       - End of the quarter consumer price index for all urban
                consumers: all items (1982-84 = 100, seasonally adjusted).
    m1        - End of the quarter M1 nominal money stock (Seasonally
                adjusted)
    tbilrate  - Quarterly monthly average of the monthly 3-month
                treasury bill: secondary market rate
    unemp     - Seasonally adjusted unemployment rate (%)
    pop       - End of the quarter total population: all ages incl. armed
                forces over seas
    infl      - Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)
    realint   - Real interest rate (tbilrate - infl)
NOTE: Although we've provided a .csv file in the Data folder, you can also build this DataFrame with the following code:
    import pandas as pd
    import statsmodels.api as sm
    df = sm.datasets.macrodata.load_pandas().data
    df.index = pd.Index(sm.tsa.datetools.dates_from_range('1959Q1', '2009Q3'))
    print(sm.datasets.macrodata.NOTE)
In [93]:
from statsmodels.tsa.filters.hp_filter import hpfilter
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/macrodata.csv',index_col=0,parse_dates=True)
df.head()
Out[93]:
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
1959-03-31 1959 1 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1959-06-30 1959 2 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
1959-09-30 1959 3 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
1959-12-31 1959 4 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
1960-03-31 1960 1 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
In [94]:
ax = df['realgdp'].plot()
ax.autoscale(axis='x',tight=True)
ax.set(ylabel='REAL GDP');

Using Statsmodels to get the trend

Related Function:

statsmodels.tsa.filters.hp_filter.hpfilter(X, lamb=1600)  Hodrick-Prescott filter

The Hodrick-Prescott filter separates a time-series $y_t$ into a trend component $\tau_t$ and a cyclical component $c_t$

$y_t = \tau_t + c_t$

The components are determined by minimizing the following quadratic loss function, where $\lambda$ is a smoothing parameter:

$\min_{\\{ \tau_{t}\\} }\sum_{t=1}^{T}c_{t}^{2}+\lambda\sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}$

The $\lambda$ value above handles variations in the growth rate of the trend component.
When analyzing quarterly data, the default lambda value of 1600 is recommended. Use 6.25 for annual data, and 129,600 for monthly data.

The following function extracts the cycle and trend component and returns it

Hodrick-Prescott filter

In [95]:
# Tuple unpacking
gdp_cycle, gdp_trend = hpfilter(df['realgdp'], lamb=1600)
#Both are a pandas.core.series.Series

Let's plot the results

In [96]:
gdp_cycle.plot();
In [97]:
gdp_trend.plot();
In [98]:
df['trend'] = gdp_trend
df[['trend','realgdp']].plot(figsize=(12,5)).autoscale(axis='x',tight=True);

ETS

Error/Trend/Seasonality Models

As we begin working with endogenous data ("endog" for short) and start to develop forecasting models, it helps to identify and isolate factors working within the system that influence behavior. Here the name "endogenous" considers internal factors, while "exogenous" would relate to external forces. These fall under the category of state space models, and include decomposition (described below), and exponential smoothing (described in an upcoming section).

The decomposition of a time series attempts to isolate individual components such as error, trend, and seasonality (ETS). We've already seen a simplistic example of this in the Introduction to Statsmodels section with the Hodrick-Prescott filter. There we separated data into a trendline and a cyclical feature that mapped observed data back to the trend.

Related Function:

statsmodels.tsa.seasonal.seasonal_decompose(x, model)   Seasonal decomposition using moving averages

For Further Reading:

Forecasting: Principles and Practice  Innovations state space models for exponential smoothing
Wikipedia  Decomposition of time series

Seasonal Decomposition

Statsmodels provides a seasonal decomposition tool we can use to separate out the different components. This lets us see quickly and visually what each component contributes to the overall behavior.

We apply an additive model when it seems that the trend is more linear and the seasonality and trend components seem to be constant over time (e.g. every year we add 10,000 passengers).
A multiplicative model is more appropriate when we are increasing (or decreasing) at a non-linear rate (e.g. each year we double the amount of passengers).

For these examples we'll use the International Airline Passengers dataset, which gives monthly totals in thousands from January 1949 to December 1960.

In [99]:
from statsmodels.tsa.seasonal import seasonal_decompose

airline = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)

airline.dropna(inplace=True)

airline.plot();

The graph above shows clearle that a additive model fits better. The number of passangers grows with time

Using the seasonal_decompose from statsmodels we will decompose the time series into its components:

In [100]:
result = seasonal_decompose(airline['Thousands of Passengers'], model='multiplicative')  # model='mul' also works
result.plot();
In [101]:
#press on tab to check the different components
result.seasonal.plot();

SMA

Simple Moving Average

We've already shown how to create a simple moving average by applying a mean function to a rolling window.

For a quick review

In [102]:
airline['6-month-SMA'] = airline['Thousands of Passengers'].rolling(window=6).mean()
airline['12-month-SMA'] = airline['Thousands of Passengers'].rolling(window=12).mean()
In [103]:
airline.plot();

EWMA-Exponentially-Weighted-Moving-Average

We just showed how to calculate the SMA based on some window. However, basic SMA has some weaknesses:

  • Smaller windows will lead to more noise, rather than signal
  • It will always lag by the size of the window
  • It will never reach to full peak or valley of the data due to the averaging.
  • Does not really inform you about possible future behavior, all it really does is describe trends in your data.
  • Extreme historical values can skew your SMA significantly

To help fix some of these issues, we can use an EWMA (Exponentially weighted moving average).

EWMA will allow us to reduce the lag effect from SMA and it will put more weight on values that occured more recently (by applying more weight to the more recent values, thus the name). The amount of weight applied to the most recent values will depend on the actual parameters used in the EWMA and the number of periods given a window size. Full details on Mathematics behind this can be found here. Here is the shorter version of the explanation behind EWMA.

The formula for EWMA is:

$y_t = \frac{\sum\limits_{i=0}^t w_i x_{t-i}}{\sum\limits_{i=0}^t w_i}$

Where $x_t$ is the input value, $w_i$ is the applied weight (Note how it can change from $i=0$ to $t$), and $y_t$ is the output.

Now the question is, how do we define the weight term $w_i$?

This depends on the adjust parameter you provide to the .ewm() method.

When adjust=True (default) is used, weighted averages are calculated using weights equal to $w_i = (1 - \alpha)^i$ This is called Simple Exponential Smoothing which gives:

$y_t = \frac{xt + (1 - \alpha)x{t-1} + (1 - \alpha)^2 x_{t-2} + ...

  • (1 - \alpha)^t x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ...
  • (1 - \alpha)^t}$

When adjust=False is specified, moving averages are calculated as:

$\begin{split}y_0 &= x_0 \

yt &= (1 - \alpha) y{t-1} + \alpha x_t,\end{split}$

which is equivalent to using weights:

\begin{split}w_i = \begin{cases} \alpha (1 - \alpha)^i & \text{if } i < t \\ (1 - \alpha)^i & \text{if } i = t. \end{cases}\end{split}

When adjust=True we have $y_0=x_0$ and from the last representation above we have $y_t=\alpha x_t+(1−α)y_{t−1}$, therefore there is an assumption that $x_0$ is not an ordinary value but rather an exponentially weighted moment of the infinite series up to that point.

The use of adjust=True or adjust=False, has to do whether a series is finite or infinite. In fact if you take an infinite series and pass it in the adjust=True formula, you will end up having the adjust=False formula.

For the smoothing factor $\alpha$ one must have $0<\alpha≤1$, and while it is possible to pass alpha directly, it’s often easier to think about either the span, center of mass (com) or half-life of an EW moment:

\begin{split}\alpha = \begin{cases} \frac{2}{s + 1}, & \text{for span}\ s \geq 1\\ \frac{1}{1 + c}, & \text{for center of mass}\ c \geq 0\\ 1 - \exp^{\frac{\log 0.5}{h}}, & \text{for half-life}\ h > 0 \end{cases}\end{split}

Those are the ways alpha can be calculated, but you can also just input it manually

  • Span corresponds to what is commonly called an “N-day EW moving average”. It's the window.
  • Center of mass has a more physical interpretation and can be thought of in terms of span: $c=(s−1)/2$
  • Half-life is the period of time for the exponential weight to reduce to one half.
  • Alpha specifies the smoothing factor directly.

We have to pass precisely one of the above into the .ewm() function. For our data we'll use span=12.

Disadvantages of EWMA:

- Does not take into account trend and seasonality. 
In [104]:
# With span = 12, we are using al alpha = 2/(12+1). Look at the above formula.
# Therefore when we pass the parameter span in the formula, we are setting alpha
airline['EWMA12'] = airline['Thousands of Passengers'].ewm(span=12,adjust=False).mean()
airline[['Thousands of Passengers','EWMA12']].plot();
In [105]:
airline['EWMA12'] = airline['Thousands of Passengers'].ewm(span=12,adjust=False).mean()
airline[['Thousands of Passengers','EWMA12']].plot();

SMA vs EWMA

In [106]:
airline[['Thousands of Passengers','EWMA12','12-month-SMA']].plot(figsize=(12,8)).autoscale(axis='x',tight=True);

Holt-Winters Methods

In the previous section on Exponentially Weighted Moving Averages (EWMA) we applied Simple Exponential Smoothing using just one smoothing factor $\alpha$ (alpha). This failed to account for other contributing factors like trend and seasonality.

In this section we'll look at Double and Triple Exponential Smoothing with the Holt-Winters Methods.

In Double Exponential Smoothing (double is because we have 2 parameters) (aka Holt's Method) we introduce a new smoothing factor $\beta$ (beta) that addresses trend :

\begin{split}l_t &= (1 - \alpha) l_{t-1} + \alpha x_t, & \text{ level}\\ b_t &= (1-\beta)b_{t-1} + \beta(l_t-l_{t-1}) & \text{ trend}\\ y_t &= l_t + b_t & \text{ fitted model}\\ \hat y_{t+h} &= l_t + hb_t & \text{ forecasting model (} h = \text{# periods into the future)}\end{split}

Because we haven't yet considered seasonal fluctuations, the forecasting model is simply a straight sloped line extending from the most recent data point. We'll see an example of this in upcoming lectures.

With Triple Exponential Smoothing (aka the Holt-Winters Method) we introduce a smoothing factor $\gamma$ (gamma) that addresses seasonality:

\begin{split}l_t &= (1 - \alpha) l_{t-1} + \alpha x_t, & \text{ level}\\ b_t &= (1-\beta)b_{t-1} + \beta(l_t-l_{t-1}) & \text{ trend}\\ c_t &= (1-\gamma)c_{t-L} + \gamma(x_t-l_{t-1}-b_{t-1}) & \text{ seasonal}\\ y_t &= (l_t + b_t) c_t & \text{ fitted model}\\ \hat y_{t+m} &= (l_t + mb_t)c_{t-L+1+(m-1)modL} & \text{ forecasting model (} m = \text{# periods into the future)}\end{split}

Here $L$ represents the number of divisions per cycle. In our case looking at monthly data that displays a repeating pattern each year, we would use $L=12$.

In general, higher values for $\alpha$, $\beta$ and $\gamma$ (values closer to 1), place more emphasis on recent data.

Related Functions:

statsmodels.tsa.holtwinters.SimpleExpSmoothing(endog)     Simple Exponential Smoothing
statsmodels.tsa.holtwinters.ExponentialSmoothing(endog)   Holt-Winters Exponential Smoothing

For Further Reading:

NIST/SEMATECH e-Handbook of Statistical Methods  What is Exponential Smoothing?

Simple Exponential Smoothing - Simple Moving Average

We worked on this above. Just as a reminder. A variation of the statmodels Holt-Winters function provides Simple Exponential Smoothing. We'll show that it performs the same calculation of the weighted moving average as the pandas .ewm() method:
$\begin{split}y_0 &= x_0 \\ y_t &= (1 - \alpha) y_{t-1} + \alpha x_t,\end{split}$

In [107]:
#We will keep using the arilines dataframe
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)

df.dropna(inplace=True)

Note that our DatetimeIndex does not have a frequency. In order to build a Holt-Winters smoothing model, statsmodels needs to know the frequency of the data (whether it's daily, monthly etc.). Since observations occur at the start of each month, we'll use MS.
A full list of time series offset aliases can be found here.

In [108]:
df.index
Out[108]:
DatetimeIndex(['1949-01-01', '1949-02-01', '1949-03-01', '1949-04-01',
               '1949-05-01', '1949-06-01', '1949-07-01', '1949-08-01',
               '1949-09-01', '1949-10-01',
               ...
               '1960-03-01', '1960-04-01', '1960-05-01', '1960-06-01',
               '1960-07-01', '1960-08-01', '1960-09-01', '1960-10-01',
               '1960-11-01', '1960-12-01'],
              dtype='datetime64[ns]', name='Month', length=144, freq=None)
In [109]:
df.index.freq = 'MS'
df.index
Out[109]:
DatetimeIndex(['1949-01-01', '1949-02-01', '1949-03-01', '1949-04-01',
               '1949-05-01', '1949-06-01', '1949-07-01', '1949-08-01',
               '1949-09-01', '1949-10-01',
               ...
               '1960-03-01', '1960-04-01', '1960-05-01', '1960-06-01',
               '1960-07-01', '1960-08-01', '1960-09-01', '1960-10-01',
               '1960-11-01', '1960-12-01'],
              dtype='datetime64[ns]', name='Month', length=144, freq='MS')
In [110]:
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
span = 12
alpha = 2/(span+1)
#as we did in the previous cells
df['EWMA12'] = df['Thousands of Passengers'].ewm(alpha=alpha,adjust=False).mean()

Now we are going to implement the SIMPLE EXP SMOOTHING model, but first let's understand what is going on with the code:

In [111]:
model = SimpleExpSmoothing(df['Thousands of Passengers'])
In [112]:
# With Tab you can check the various attributes of the object "model", like "fit"
# As paremeters in the fit method we have the smoothing_level which is alpha
fitted_model =  model.fit(smoothing_level=alpha,optimized=False)

# This returns a HoltWintersResultsWrapper

fitted_model.fittedvalues
Out[112]:
Month
1949-01-01    112.000000
1949-02-01    112.000000
1949-03-01    112.923077
1949-04-01    115.857988
1949-05-01    117.879836
                 ...    
1960-08-01    474.698368
1960-09-01    494.898619
1960-10-01    496.914216
1960-11-01    491.388952
1960-12-01    475.790652
Freq: MS, Length: 144, dtype: float64
NOTE: For some reason, when optimized=False is passed into .fit(), the statsmodels SimpleExpSmoothing function shifts fitted values down one row. We fix this by adding .shift(-1) after .fittedvalues
In [113]:
# We will simply solve the problem with the shift function

fitted_model.fittedvalues.shift(-1)
Out[113]:
Month
1949-01-01    112.000000
1949-02-01    112.923077
1949-03-01    115.857988
1949-04-01    117.879836
1949-05-01    118.359861
                 ...    
1960-08-01    494.898619
1960-09-01    496.914216
1960-10-01    491.388952
1960-11-01    475.790652
1960-12-01           NaN
Freq: MS, Length: 144, dtype: float64
In [114]:
# we add the predictions to our dataframe

df['SES12'] = fitted_model.fittedvalues.shift(-1)
In [115]:
df.head()
Out[115]:
Thousands of Passengers EWMA12 SES12
Month
1949-01-01 112 112.000000 112.000000
1949-02-01 118 112.923077 112.923077
1949-03-01 132 115.857988 115.857988
1949-04-01 129 117.879836 117.879836
1949-05-01 121 118.359861 118.359861

Note that we could have simply put all the code in the cells into one line of code:

In [116]:
#df['SES12']=SimpleExpSmoothing(df['Thousands of Passengers']).fit(smoothing_level=alpha,optimized=False).fittedvalues.shift(-1)
In [117]:
df
Out[117]:
Thousands of Passengers EWMA12 SES12
Month
1949-01-01 112 112.000000 112.000000
1949-02-01 118 112.923077 112.923077
1949-03-01 132 115.857988 115.857988
1949-04-01 129 117.879836 117.879836
1949-05-01 121 118.359861 118.359861
... ... ... ...
1960-08-01 606 494.898619 494.898619
1960-09-01 508 496.914216 496.914216
1960-10-01 461 491.388952 491.388952
1960-11-01 390 475.790652 475.790652
1960-12-01 432 469.053629 NaN

144 rows × 3 columns

In [118]:
# EWMA12 and SES12 are the same in this particular case
df.plot(figsize=(14,6))
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ee5fcafd0>

Now that we have the Simple Exponential Smoothing, let's try the Double Exponential Smoothing, before we plot the results.


Double Exponential Smoothing - Holt's Method

Where Simple Exponential Smoothing employs just one smoothing factor $\alpha$ (alpha), Double Exponential Smoothing adds a second smoothing factor $\beta$ (beta) that addresses trends in the data. Like the alpha factor, values for the beta factor fall between zero and one ($0<\beta≤1$). The benefit here is that the model can anticipate future increases or decreases where the level model would only work from recent calculations.

We can also address different types of change (growth/decay) in the trend. If a time series displays a straight-line sloped trend, you would use an additive adjustment. If the time series displays an exponential (curved) trend, you would use a multiplicative adjustment.

As we move toward forecasting, it's worth noting that both additive and multiplicative adjustments may become exaggerated over time, and require damping that reduces the size of the trend over future periods until it reaches a flat line.

In [119]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

#It's hard to say wethet the passengers data has a multiplicative or additive trend, we will assume it is additive


df['DESadd12'] = ExponentialSmoothing(df['Thousands of Passengers'], trend='add').fit().fittedvalues.shift(-1)
df.head()
Out[119]:
Thousands of Passengers EWMA12 SES12 DESadd12
Month
1949-01-01 112 112.000000 112.000000 114.237773
1949-02-01 118 112.923077 112.923077 120.237773
1949-03-01 132 115.857988 115.857988 134.237773
1949-04-01 129 117.879836 117.879836 131.237773
1949-05-01 121 118.359861 118.359861 123.237773

The DESadd12 is almost the same as the number of passangers (the blue line is just behind the red)

In [120]:
df.plot(figsize=(12,5))
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ee5f564a8>
In [121]:
# Let's do some Zoom
df.iloc[:24].plot(figsize=(12,5))
Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ee44eb5c0>

Double exponential is clearly better than Simple

What about Tripple ?


Triple Exponential Smoothing - Holt-Winters Method

Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data.

In [122]:
df['TESadd12'] = ExponentialSmoothing(df['Thousands of Passengers'],trend='add',seasonal='add',seasonal_periods=12).fit().fittedvalues
df.head()
/home/rangelrey/.local/share/virtualenvs/timeseriesanalysis-xa2U-udI/lib/python3.6/site-packages/statsmodels/tsa/holtwinters.py:744: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  ConvergenceWarning)
Out[122]:
Thousands of Passengers EWMA12 SES12 DESadd12 TESadd12
Month
1949-01-01 112 112.000000 112.000000 114.237773 113.081288
1949-02-01 118 112.923077 112.923077 120.237773 120.550747
1949-03-01 132 115.857988 115.857988 134.237773 135.527329
1949-04-01 129 117.879836 117.879836 131.237773 133.155064
1949-05-01 121 118.359861 118.359861 123.237773 125.656114
In [123]:
df['TESmul12'] = ExponentialSmoothing(df['Thousands of Passengers'],trend='mul',seasonal='mul',seasonal_periods=12).fit().fittedvalues
df.head()
Out[123]:
Thousands of Passengers EWMA12 SES12 DESadd12 TESadd12 TESmul12
Month
1949-01-01 112 112.000000 112.000000 114.237773 113.081288 111.614584
1949-02-01 118 112.923077 112.923077 120.237773 120.550747 118.856901
1949-03-01 132 115.857988 115.857988 134.237773 135.527329 133.344660
1949-04-01 129 117.879836 117.879836 131.237773 133.155064 127.916273
1949-05-01 121 118.359861 118.359861 123.237773 125.656114 120.993591
In [124]:
df[['Thousands of Passengers','TESadd12','TESmul12']].plot(figsize=(12,6)).autoscale(axis='x',tight=True);
In [125]:
# Let's chekc for the first 2 years (24 months)
df[['Thousands of Passengers','TESadd12','TESmul12']].iloc[:24].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Forecasting

In the previous section we fit various smoothing models to existing data. So the purpose was not to forecast, but predict. The purpose now is to predict what happens next.
What's our best guess for next month's value? For the next six months?

In this section we'll look to extend our models into the future. First we'll divide known data into training and testing sets, and evaluate the performance of a trained model on known test data.

  • Goals
    • Compare a Holt-Winters forecasted model to known data
    • Understand stationarity, differencing and lagging
    • Introduce ARIMA and describe next steps

The Forecasting Procedure looks like:

  • Modle selection
  • Splitting data into train and test sets
  • Fit model on training set
  • Evaluate model on test set
  • Re-fit model on entire data set
  • Forecast for future data

Forecasting with the Holt-Winters Method

For this example we'll use the same airline_passengers dataset, and we'll split the data into 108 training records and 36 testing records. Then we'll evaluate the performance of the model.

In [126]:
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)
df.index.freq = 'MS'
df.head()
Out[126]:
Thousands of Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121

Train Test Split

In [127]:
train_data = df.iloc[:109] # Goes up to but not including 109
test_data = df.iloc[108:]

Fitting the Model

In [128]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

fitted_model = ExponentialSmoothing(train_data['Thousands of Passengers'],trend='mul',seasonal='mul',seasonal_periods=12).fit()

Evaluating model against test

In [129]:
train_data['Thousands of Passengers'].plot(legend=True,label='TRAIN')
test_data['Thousands of Passengers'].plot(legend=True,label='TEST',figsize=(12,8));
In [130]:
test_predictions = fitted_model.forecast(36).rename('HW Forecast')
In [131]:
test_predictions.plot(legend=True, label="Prediction")
Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ee617de48>
In [132]:
train_data['Thousands of Passengers'].plot(legend=True,label='TRAIN')
test_data['Thousands of Passengers'].plot(legend=True,label='TEST',figsize=(12,8));
test_predictions.plot(legend=True, label="Prediction")
Out[132]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ee60c5358>

The prediction (green line) seems to be quite accurate compared to the test data. Remember, the test data is real.

Recall and accuracy are not appropiate evaluation metrics for forecasting techniques. We need metrics designed for continuous values

Evaluation Metrics

Let's analyse how can we use the most baic regression evaluation metrics

* Mean Absolute Error
* Mean Squared Error
* Root Mean Square Error
In [133]:
from sklearn.metrics import mean_squared_error,mean_absolute_error

Mean Absolute Error

image.png being y the true value and y hat the predicted value

Disadvantages: It does not take into account if a few predicted points are actually very far away from real points. Exmaple: imagine spending for a mktg campaign is very high in december (Xmas) and the rest of the months is similar, then the MAE might not show us the effect of that month, since it is a mean.

Therefore we have the MSE as a solution

In [134]:
mean_absolute_error(test_data, test_predictions)
Out[134]:
63.03016344745991

To analyse the MAE, you can check the summary key figures of the test data

In [135]:
test_data.describe()
Out[135]:
Thousands of Passengers
count 36.000000
mean 428.500000
std 79.329152
min 310.000000
25% 362.000000
50% 412.000000
75% 472.000000
max 622.000000

Mean Squared Error

image.png

Since it is squared, those predicted points that are very far away from the real ones will have more importance in the calculation. You punish the model by having large errors.

Of course this means we cannot interpret directly the result (since it is squared). So we have the RMSE for this

Root Mean Square Error

image.png

Squaring it we get the units back in its original form (like std with the variance). The intepretation dependends then on the data. A RMSE of 20€ of the price of a house is a very good, but for candy it's not.

In [136]:
np.sqrt(mean_squared_error(test_data,test_predictions))
Out[136]:
74.92727460739599

Our mean squarred error is less than our standard deviation. Therefore, the method is not doing a bad job

Forecasting Future Data

We will use the whole dataset now to predict the future values

In [137]:
#We creat the fitted model object
fitted_model = ExponentialSmoothing(train_data['Thousands of Passengers'],trend='mul',seasonal='mul',seasonal_periods=12).fit()

forecast_predictions= fitted_model.forecast(36)

df['Thousands of Passengers'].plot(figsize=(12,8))
forecast_predictions.plot();

In orange we have the predictions. Pretty accurate.

Stationarity

Time series data is said to be stationary if it does not exhibit trends or seasonality. That is, fluctuations in the data are entirely due to outside forces and noise. The file samples.csv contains made-up datasets that illustrate stationary and non-stationary data.

In [138]:
df2 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/samples.csv',index_col=0,parse_dates=True)
df2.head()
Out[138]:
a b c
1950-01-01 36 27 0
1950-02-01 58 22 3
1950-03-01 61 17 5
1950-04-01 37 15 8
1950-05-01 66 13 8
In [139]:
df2['a'].plot(ylim=[0,100],title="STATIONARY DATA").autoscale(axis='x',tight=True);
In [140]:
df2['b'].plot(ylim=[0,100],title="NON-STATIONARY DATA").autoscale(axis='x',tight=True);

Differencing

Non-stationary data can be made to look stationary through differencing. A simple differencing method calculates the difference between consecutive points.

Related Functions:

statespace.tools.diff(series[, k_diff, …])  Difference a series simply and/or seasonally along the zero-th axis.

You can calculate manually the first order difference by substracting:

In [141]:
df2["b"]-df2["b"].shift(1)
Out[141]:
1950-01-01     NaN
1950-02-01    -5.0
1950-03-01    -5.0
1950-04-01    -2.0
1950-05-01    -2.0
              ... 
1959-08-01     3.0
1959-09-01     4.0
1959-10-01    -7.0
1959-11-01    17.0
1959-12-01   -14.0
Name: b, Length: 120, dtype: float64

But statsmodels has a function that does it for you

In [142]:
from statsmodels.tsa.statespace.tools import diff
df2['d1'] = diff(df2['b'],k_diff=1)

df2['d1'].plot(title="FIRST DIFFERENCE DATA").autoscale(axis='x',tight=True);

Introduction to ARIMA Models

We'll investigate a variety of different forecasting models in upcoming sections, but they all stem from ARIMA.

ARIMA, or Autoregressive Integrated Moving Average is actually a combination of 3 models:

  • AR(p) Autoregression - a regression model that utilizes the dependent relationship between a current observation and observations over a previous period
  • I(d) Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary
  • MA(q) Moving Average - a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Moving Averages we've already seen with EWMA and the Holt-Winters Method.
Integration will apply differencing to make a time series stationary, which ARIMA requires.
Autoregression is explained in detail in the next section. Here we're going to correlate a current time series with a lagged version of the same series.
Once we understand the components, we'll investigate how to best choose the $p$, $d$ and $q$ values required by the model.

For times series the covariance (autocovariance) will be: ${\displaystyle \gamma_k = \frac 1 n \sum\limits_{t=1}^{n-k} (y_t - \bar{y})(y_{t+k}-\bar{y})}$

Autocovariance Example:

Say we have a time series with five observations: {13, 5, 11, 12, 9}.
We can quickly see that $n = 5$, the mean $\bar{y} = 10$, and we'll see that the variance $\sigma^2 = 8$.
The following calculations give us our covariance values:

$\gamma_0 = \frac {(13-10)(13-10)+(5-10)(5-10)+(11-10)(11-10)+(12-10)(12-10)+(9-10)(9-10)} 5 = \frac {40} 5 = 8.0 \\ \gamma_1 = \frac {(13-10)(5-10)+(5-10)(11-10)+(11-10)(12-10)+(12-10)(9-10)} 5 = \frac {-20} 5 = -4.0 \\ \gamma_2 = \frac {(13-10)(11-10)+(5-10)(12-10)+(11-10)(9-10)} 5 = \frac {-8} 5 = -1.6 \\ \gamma_3 = \frac {(13-10)(12-10)+(5-10)(9-10)} 5 = \frac {11} 5 = 2.2 \\ \gamma_4 = \frac {(13-10)(9-10)} 5 = \frac {-3} 5 = -0.6$

Note that $\gamma_0$ is just the population variance $\sigma^2$

Let's see if statsmodels gives us the same results! For this we'll create a fake dataset:

In [143]:
import pandas as pd
import numpy as np
%matplotlib inline
import statsmodels.api as sm
# Import the models we'll be using in this section
from statsmodels.tsa.stattools import acovf,acf,pacf,pacf_yw,pacf_ols
from pandas.plotting import lag_plot


import warnings
warnings.filterwarnings("ignore")
df = pd.DataFrame({'a':[13, 5, 11, 12, 9]})
arr = acovf(df['a'])
arr
Out[143]:
array([ 8. , -4. , -1.6,  2.2, -0.6])

Unbiased Autocovariance

Note that the number of terms in the calculations above are decreasing.
Statsmodels can return an "unbiased" autocovariance where instead of dividing by $n$ we divide by $n-k$.

$\gamma_0 = \frac {(13-10)(13-10)+(5-10)(5-10)+(11-10)(11-10)+(12-10)(12-10)+(9-10)(9-10)} {5-0} = \frac {40} 5 = 8.0 \\ \gamma_1 = \frac {(13-10)(5-10)+(5-10)(11-10)+(11-10)(12-10)+(12-10)(9-10)} {5-1} = \frac {-20} 4 = -5.0 \\ \gamma_2 = \frac {(13-10)(11-10)+(5-10)(12-10)+(11-10)(9-10)} {5-2} = \frac {-8} 3 = -2.67 \\ \gamma_3 = \frac {(13-10)(12-10)+(5-10)(9-10)} {5-3} = \frac {11} 2 = 5.5 \\ \gamma_4 = \frac {(13-10)(9-10)} {5-4} = \frac {-3} 1 = -3.0$

In [144]:
arr2 = acovf(df['a'],unbiased=True)
arr2
Out[144]:
array([ 8.        , -5.        , -2.66666667,  5.5       , -3.        ])

Autocorrelation for 1D

The correlation $\rho$ (rho) between two variables $y_1,y_2$ is given as:

$\rho = \frac {\operatorname E[(y_1−\mu_1)(y_2−\mu_2)]} {\sigma_{1}\sigma_{2}} = \frac {\operatorname {Cov} (y_1,y_2)} {\sigma_{1}\sigma_{2}}$,

where $E$ is the expectation operator, $\mu_{1},\sigma_{1}$ and $\mu_{2},\sigma_{2}$ are the means and standard deviations of $y_1$ and $y_2$.

When working with a single variable (i.e. autocorrelation) we would consider $y_1$ to be the original series and $y_2$ a lagged version of it. Note that with autocorrelation we work with $\bar y$, that is, the full population mean, and not the means of the reduced set of lagged factors (see note below).

Thus, the formula for $\rho_k$ for a time series at lag $k$ is:

${\displaystyle \rho_k = \frac {\sum\limits_{t=1}^{n-k} (y_t - \bar{y})(y_{t+k}-\bar{y})} {\sum\limits_{t=1}^{n} (y_t - \bar{y})^2}}$

This can be written in terms of the covariance constant $\gamma_k$ as:

${\displaystyle \rho_k = \frac {\gamma_k n} {\gamma_0 n} = \frac {\gamma_k} {\sigma^2}}$

For example,
$\rho_4 = \frac {\gamma_4} {\sigma^2} = \frac{-0.6} {8} = -0.075$

Note that ACF values are bound by -1 and 1. That is, ${\displaystyle -1 \leq \rho_k \leq 1}$

In [145]:
arr3 = acf(df['a'])
arr3
Out[145]:
array([ 1.   , -0.5  , -0.2  ,  0.275, -0.075])

Partial Autocorrelation

Partial autocorrelations measure the linear dependence of one variable after removing the effect of other variable(s) that affect both variables. That is, the partial autocorrelation at lag $k$ is the autocorrelation between $y_t$ and $y_{t+k}$ that is not accounted for by lags $1$ through $k−1$.

A common method employs the non-recursive Yule-Walker Equations:

$\phi_0 = 1\\ \phi_1 = \rho_1 = -0.50\\ \phi_2 = \frac {\rho_2 - {\rho_1}^2} {1-{\rho_1}^2} = \frac {(-0.20) - {(-0.50)}^2} {1-{(-0.50)}^2}= \frac {-0.45} {0.75} = -0.60$

As $k$ increases, we can solve for $\phi_k$ using matrix algebra and the Levinson–Durbin recursion algorithm which maps the sample autocorrelations $\rho$ to a Toeplitz diagonal-constant matrix. The full solution is beyond the scope of this course, but the setup is as follows:

$\displaystyle \begin{pmatrix}\rho_0&\rho_1&\cdots &\rho_{k-1}\\ \rho_1&\rho_0&\cdots &\rho_{k-2}\\ \vdots &\vdots &\ddots &\vdots \\ \rho_{k-1}&\rho_{k-2}&\cdots &\rho_0\\ \end{pmatrix}\quad \begin{pmatrix}\phi_{k1}\\\phi_{k2}\\\vdots\\\phi_{kk}\end{pmatrix} \mathbf = \begin{pmatrix}\rho_1\\\rho_2\\\vdots\\\rho_k\end{pmatrix}$

In [146]:
# it's 4 because we have 4 lags
# mle stands for maximum likelihood estimation --> this uses the biased estimated coefficients
# yw stands for yule-walker equations
arr4 = pacf_yw(df['a'],nlags=4,method='mle')
arr4
Out[146]:
array([ 1.        , -0.5       , -0.6       , -0.38541667, -0.40563273])
NOTE: We passed in method='mle' above in order to use biased ACF coefficients. "mle" stands for "maximum likelihood estimation". Alternatively we can pass method='unbiased' (the statsmodels default):
In [147]:
arr5 = pacf_yw(df['a'],nlags=4,method='unbiased')  
arr5
Out[147]:
array([ 1.        , -0.625     , -1.18803419,  2.03764205,  0.8949589 ])

instead of YW you can use as well OLS

In [148]:
arr6 = pacf_ols(df['a'],nlags=4)
arr6
Out[148]:
array([ 1.        , -0.49677419, -0.43181818,  0.53082621,  0.25434783])

Plotting

The arrays returned by .acf(df) and .pacf_yw(df) show the magnitude of the autocorrelation for a given $y$ at time $t$. Before we look at plotting arrays, let's look at the data itself for evidence of autocorrelation.

Pandas has a built-in plotting function that plots increasing $y_t$ values on the horizontal axis against lagged versions of the values $y_{t+1}$ on the vertical axis. If a dataset is non-stationary with an upward trend, then neighboring values should trend in the same way. Let's look at the Airline Passengers dataset first.

In [149]:
# Load a non-stationary dataset
df1 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)
df1.index.freq = 'MS'

# Load a stationary dataset
df2 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/DailyTotalFemaleBirths.csv',index_col='Date',parse_dates=True)
df2.index.freq = 'D'

Let's look at a lag plot of the non-stationary data set (the airline)

In [150]:
lag_plot(df1['Thousands of Passengers']);

This shows evidence of very strong autocorrelation

Let's look at a lag plot of the stationary data set (the births)

In [151]:
lag_plot(df2['Births']);

There is no autocorrelation. The number of births of today is not correlated with the number of births from yesterday

ACF Plots

Plotting the magnitude of the autocorrelations over the first few (20-40) lags can say a lot about a time series.

For example, consider the stationary Daily Total Female Births dataset:

In [152]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
In [153]:
# Now let's plot the autocorrelation at different lags
title = 'Autocorrelation: Airline Passengers'
lags = 40
plot_acf(df1,title=title,lags=lags);

There is clearly signs os seasonality in the data.

The shaded region is a 95% confidence interval. Which means that those points (those lags) outside the blue shaded area are more like to be correlated with the current year. While the higher the lag, the less likely is that there is signiciant correlation

In [173]:
title = 'Autocorrelation: Daily Female Births'
lags = 40
plot_acf(df2,title=title,lags=lags);

AR(p)

Autoregressive Model

In a moving average model as we saw with Holt-Winters, we forecast the variable of interest using a linear combination of predictors. In our example we forecasted numbers of airline passengers in thousands based on a set of level, trend and seasonal predictors.

In an autoregression model, we forecast using a linear combination of past values of the variable. The term autoregression describes a regression of the variable against itself. An autoregression is run against a set of lagged values of order $p$.

$y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + \varepsilon_{t}$

where $c$ is a constant, $\phi_{1}$ and $\phi_{2}$ are lag coefficients up to order $p$, and $\varepsilon_{t}$ is white noise.

For example, an AR(1) model would follow the formula

    $y_{t} = c + \phi_{1}y_{t-1} + \varepsilon_{t}$

whereas an AR(2) model would follow the formula

    $y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \varepsilon_{t}$

and so on.

Note that the lag coeffients are usually less than one, as we usually restrict autoregressive models to stationary data.
Specifically, for an AR(1) model: $-1 \lt \phi_1 \lt 1$
and for an AR(2) model: $-1 \lt \phi_2 \lt 1, \ \phi_1 + \phi_2 \lt 1, \ \phi_2 - \phi_1 \lt 1$

Models AR(3) and higher become mathematically very complex. Fortunately statsmodels does all the heavy lifting for us.

Related Functions:

ar_model.AR(endog[, dates, freq, missing])  Autoregressive AR(p) model
ar_model.ARResults(model, params[, …])      Class to hold results from fitting an AR model

For Further Reading:

Forecasting: Principles and Practice  Autoregressive models
Wikipedia  Autoregressive model

Perform standard imports and load datasets

For this exercise we'll look at monthly U.S. population estimates in thousands from January 2011 to December 2018 (96 records, 8 years of data). Population includes resident population plus armed forces overseas. The monthly estimate is the average of estimates for the first of the month and the first of the following month. Source: https://fred.stlouisfed.org/series/POPTHM

In [155]:
# Load specific forecasting tools
from statsmodels.tsa.ar_model import AR,ARResults

# Load the U.S. Population dataset
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/uspopulation.csv',index_col='DATE',parse_dates=True)
df.index.freq = 'MS'
In [156]:
title='U.S. Monthly Population Estimates'
ylabel='Pop. Est. (thousands)'
xlabel='' # we don't really need a label here

ax = df['PopEst'].plot(figsize=(12,5),title=title);
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Split the data into train/test sets

The goal in this section is to:

  • Split known data into a training set of records on which to fit the model
  • Use the remaining records for testing, to evaluate the model
  • Fit the model again on the full set of records
  • Predict a future set of values using the model

As a general rule you should set the length of your test set equal to your intended forecast size. That is, for a monthly dataset you might want to forecast out one more year. Therefore your test set should be one year long.

NOTE: For many training and testing applications we would use the train_test_split() function available from Python's scikit-learn library. This won't work here as train_test_split() takes random samples of data from the population.

Using the previous month to predict is not perfect, but still we get an acceptable result

In [173]:
# Set one year for testing
train = df.iloc[:84]
test = df.iloc[84:]
In [175]:
model = AR(train['PopEst'])
AR1fit = model.fit(maxlag=1)
print(f'Lag: {AR1fit.k_ar}')
print(f'Coefficients:\n{AR1fit.params}')

# This is the general format for obtaining predictions
start=len(train)
end=len(train)+len(test)-1
predictions1 = AR1fit.predict(start=start, end=end, dynamic=False).rename('AR(1) Predictions')

test['PopEst'].plot(legend=True)
predictions1.plot(legend=True,figsize=(12,6));
Lag: 1
Coefficients:
const        284.913797
L1.PopEst      0.999686
dtype: float64

AR(2)

In [179]:
# Recall that our model was already created above based on the training set
model2 = AR(train['PopEst'])
AR2fit = model2.fit(maxlag=2)
print(f'Lag: {AR2fit.k_ar}')
print(f'Coefficients:\n{AR2fit.params}')

start=len(train)
end=len(train)+len(test)-1
predictions2 = AR2fit.predict(start=start, end=end, dynamic=False).rename('AR(2) Predictions')

test['PopEst'].plot(legend=True)
predictions1.plot(legend=True)
predictions2.plot(legend=True,figsize=(12,6));
Lag: 2
Coefficients:
const        137.368305
L1.PopEst      1.853490
L2.PopEst     -0.853836
dtype: float64

How do we actually find the best number of lags? We can optimize for "p" (number of lags)

If we leave the maxlg empty, statsmodels tries to find the optimal value.

In [184]:
modelp = AR(train['PopEst'])

ARfit = modelp.fit()
print(f'Lag: {ARfit.k_ar}')
print(f'Coefficients:\n{ARfit.params}')

predictionsp = ARfit.predict(start=start, end=end, dynamic=False).rename('AR(p) Predictions')

test['PopEst'].plot(legend=True)
predictions1.plot(legend=True)
predictions2.plot(legend=True)
predictionsp.plot(legend=True,figsize=(12,6));
Lag: 11
Coefficients:
const         96.143523
L1.PopEst      2.298626
L2.PopEst     -2.027441
L3.PopEst      1.356878
L4.PopEst     -1.171630
L5.PopEst      0.816723
L6.PopEst     -0.699990
L7.PopEst      0.727004
L8.PopEst     -0.354790
L9.PopEst      0.241036
L10.PopEst    -0.179865
L11.PopEst    -0.006819
dtype: float64

But we can see that it actually has similar results as AR(2)

What we can do is to change the criterion that statsmodels uses to determine p. We will use the t-stat criterion

In [187]:
modelpp = AR(train['PopEst'])

ARfit = modelpp.fit(ic='t-stat')
print(f'Lag: {ARfit.k_ar}')
print(f'Coefficients:\n{ARfit.params}')

predictionspp = ARfit.predict(start=start, end=end, dynamic=False).rename('AR(pp) Predictions')

test['PopEst'].plot(legend=True)
predictionsp.plot(legend=True)
predictionspp.plot(legend=True,figsize=(12,6));
Lag: 8
Coefficients:
const        82.309677
L1.PopEst     2.437997
L2.PopEst    -2.302100
L3.PopEst     1.565427
L4.PopEst    -1.431211
L5.PopEst     1.125022
L6.PopEst    -0.919494
L7.PopEst     0.963694
L8.PopEst    -0.439511
dtype: float64

That looks much better. But let's look at the mean squared error to see which one has a lowe error

In [193]:
preds = [predictions1, predictions2, predictionsp, predictionspp]
labels = ['AR(1)','AR(2)','AR(p)', 'AR(pp)']
for i in range(len(preds)):
    error = mean_squared_error(test['PopEst'],preds[i])
    print(f'{labels[i]} MSE was :{error}')
AR(1) MSE was :17449.714238785422
AR(2) MSE was :2713.258528140898
AR(p) MSE was :3206.162817657509
AR(pp) MSE was :186.97397472611956

The last model, which used the t-stat as criterion is the one with the lower MSE

Forecasting

Now we're ready to train our best model on the greatest amount of data, and fit it to future dates. Using the last model

In [194]:
# First, retrain the model on the full dataset
model = AR(df['PopEst'])

# Next, fit the model
ARfit = model.fit(ic='t-stat')

# Make predictions
fcast = ARfit.predict(start=len(df), end=len(df)+12, dynamic=False).rename('Forecast')

# Plot the results
df['PopEst'].plot(legend=True)
fcast.plot(legend=True,figsize=(12,6));

Descriptive Statistics and Tests

In upcoming sections we'll talk about different forecasting models like ARMA, ARIMA, Seasonal ARIMA and others. Each model addresses a different type of time series. For this reason, in order to select an appropriate model we need to know something about the data.

In this section we'll learn how to determine if a time series is stationary, if it's independent, and if two series demonstrate correlation and/or causality.

  • Goals
    • Be able to perform Augmented Dickey Fuller Test
    • Kwiatkowski-Phillips-Schmidt-Shin test for stationarity.
    • Calculate the BDS test statistic for independence of a time series
    • Return’s Ljung-Box Q Statistic
    • four tests for granger non-causality of 2 timeseries (maybe do this tests on two airline stocks against each other, or gas price versus airline stock/travel costs)

Related Functions:

stattools.ccovf(x, y[, unbiased, demean])  crosscovariance for 1D
stattools.ccf(x, y[, unbiased])            cross-correlation function for 1d
stattools.periodogram(X)                   Returns the periodogram for the natural frequency of X
stattools.adfuller(x[, maxlag, regression, …])  Augmented Dickey-Fuller unit root test
stattools.kpss(x[, regression, lags, store])    Kwiatkowski-Phillips-Schmidt-Shin test for stationarity.
stattools.coint(y0, y1[, trend, method, …])     Test for no-cointegration of a univariate equation
stattools.bds(x[, max_dim, epsilon, distance])  Calculate the BDS test statistic for independence of a time series
stattools.q_stat(x, nobs[, type])               Returns Ljung-Box Q Statistic
stattools.grangercausalitytests(x, maxlag[, …]) Four tests for granger non-causality of 2 timeseries
stattools.levinson_durbin(s[, nlags, isacov])   Levinson-Durbin recursion for autoregressive processes
stattools.eval_measures.mse(x1, x2, axis=0)      mean squared error
stattools.eval_measures.rmse(x1, x2, axis=0)     root mean squared error
stattools.eval_measures.meanabs(x1, x2, axis=0)  mean absolute error

For Further Reading:

Wikipedia:  Augmented Dickey–Fuller test
Forecasting: Principles and Practice:  Evaluating forecast accuracy
In [195]:
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

# Load a seasonal dataset
df1 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)
df1.index.freq = 'MS'

# Load a nonseasonal dataset
df2 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/DailyTotalFemaleBirths.csv',index_col='Date',parse_dates=True)
df2.index.freq = 'D'

from statsmodels.tsa.stattools import ccovf,ccf,periodogram

from statsmodels.tsa.stattools import adfuller,kpss,coint,bds,q_stat,grangercausalitytests,levinson_durbin

from statsmodels.tools.eval_measures import mse, rmse, meanabs

Tests for Stationarity

A time series is stationary if the mean and variance are fixed between any two equidistant points. That is, no matter where you take your observations, the results should be the same. A times series that shows seasonality is not stationary.

A test for stationarity usually involves a unit root hypothesis test, where the null hypothesis $H_0$ is that the series is nonstationary, and contains a unit root. The alternate hypothesis $H_1$ supports stationarity. The augmented Dickey-Fuller Test is one such test.

Augmented Dickey-Fuller Test

To determine whether a series is stationary we can use the augmented Dickey-Fuller Test. In this test the null hypothesis states that $\phi = 1$ (this is also called a unit test). The test returns several statistics we'll see in a moment. Our focus is on the p-value. A small p-value ($p<0.05$) indicates strong evidence against the null hypothesis.

To demonstrate, we'll use a dataset we know is not stationary, the airline_passenger dataset. First, let's plot the data along with a 12-month rolling mean and standard deviation:

In [196]:
df1['12-month-SMA'] = df1['Thousands of Passengers'].rolling(window=12).mean()
df1['12-month-Std'] = df1['Thousands of Passengers'].rolling(window=12).std()

df1[['Thousands of Passengers','12-month-SMA','12-month-Std']].plot();

Not only is this dataset seasonal with a clear upward trend, the standard deviation increases over time as well.

In [197]:
print('Augmented Dickey-Fuller Test on Airline Data')
# Note that autolag = AIC, Akaike Information Criteria method
dftest = adfuller(df1['Thousands of Passengers'],autolag='AIC')
dftest
Augmented Dickey-Fuller Test on Airline Data
Out[197]:
(0.8153688792060465,
 0.991880243437641,
 13,
 130,
 {'1%': -3.4816817173418295,
  '5%': -2.8840418343195267,
  '10%': -2.578770059171598},
 996.692930839019)

To understand what those values mean, you can also check more information with the help() function

In [200]:
help(adfuller)
Help on function adfuller in module statsmodels.tsa.stattools:

adfuller(x, maxlag=None, regression='c', autolag='AIC', store=False, regresults=False)
    Augmented Dickey-Fuller unit root test.
    
    The Augmented Dickey-Fuller test can be used to test for a unit root in a
    univariate process in the presence of serial correlation.
    
    Parameters
    ----------
    x : array_like, 1d
        The data series to test.
    maxlag : int
        Maximum lag which is included in test, default 12*(nobs/100)^{1/4}.
    regression : {'c','ct','ctt','nc'}
        Constant and trend order to include in regression.
    
        * 'c' : constant only (default).
        * 'ct' : constant and trend.
        * 'ctt' : constant, and linear and quadratic trend.
        * 'nc' : no constant, no trend.
    
    autolag : {'AIC', 'BIC', 't-stat', None}
        Method to use when automatically determining the lag.
    
        * if None, then maxlag lags are used.
        * if 'AIC' (default) or 'BIC', then the number of lags is chosen
          to minimize the corresponding information criterion.
        * 't-stat' based choice of maxlag.  Starts with maxlag and drops a
          lag until the t-statistic on the last lag length is significant
          using a 5%-sized test.
    store : bool
        If True, then a result instance is returned additionally to
        the adf statistic. Default is False.
    regresults : bool, optional
        If True, the full regression results are returned. Default is False.
    
    Returns
    -------
    adf : float
        The test statistic.
    pvalue : float
        MacKinnon's approximate p-value based on MacKinnon (1994, 2010).
    usedlag : int
        The number of lags used.
    nobs : int
        The number of observations used for the ADF regression and calculation
        of the critical values.
    critical values : dict
        Critical values for the test statistic at the 1 %, 5 %, and 10 %
        levels. Based on MacKinnon (2010).
    icbest : float
        The maximized information criterion if autolag is not None.
    resstore : ResultStore, optional
        A dummy class with results attached as attributes.
    
    Notes
    -----
    The null hypothesis of the Augmented Dickey-Fuller is that there is a unit
    root, with the alternative that there is no unit root. If the pvalue is
    above a critical size, then we cannot reject that there is a unit root.
    
    The p-values are obtained through regression surface approximation from
    MacKinnon 1994, but using the updated 2010 tables. If the p-value is close
    to significant, then the critical values should be used to judge whether
    to reject the null.
    
    The autolag option and maxlag for it are described in Greene.
    
    References
    ----------
    .. [1] W. Green.  "Econometric Analysis," 5th ed., Pearson, 2003.
    
    .. [2] Hamilton, J.D.  "Time Series Analysis".  Princeton, 1994.
    
    .. [3] MacKinnon, J.G. 1994.  "Approximate asymptotic distribution functions for
        unit-root and cointegration tests.  `Journal of Business and Economic
        Statistics` 12, 167-76.
    
    .. [4] MacKinnon, J.G. 2010. "Critical Values for Cointegration Tests."  Queen's
        University, Dept of Economics, Working Papers.  Available at
        http://ideas.repec.org/p/qed/wpaper/1227.html
    
    Examples
    --------
    See example notebook

Since this is a bit annoying, we can customize it for our own use

In [201]:
print('Augmented Dickey-Fuller Test on Airline Data')

dfout = pd.Series(dftest[0:4],index=['ADF test statistic','p-value','# lags used','# observations'])

for key,val in dftest[4].items():
    dfout[f'critical value ({key})']=val
print(dfout)
Augmented Dickey-Fuller Test on Airline Data
ADF test statistic        0.815369
p-value                   0.991880
# lags used              13.000000
# observations          130.000000
critical value (1%)      -3.481682
critical value (5%)      -2.884042
critical value (10%)     -2.578770
dtype: float64

Here we have a very high p-value at 0.99, which provides weak evidence against the null hypothesis, and so we fail to reject the null hypothesis, and decide that our dataset is not stationary.
Note: in statistics we don't "accept" a null hypothesis - nothing is ever truly proven - we just fail to reject it.

Now let's apply the ADF test to stationary data with the Daily Total Female Births dataset.

In [202]:
df2['30-Day-SMA'] = df2['Births'].rolling(window=30).mean()
df2['30-Day-Std'] = df2['Births'].rolling(window=30).std()

df2[['Births','30-Day-SMA','30-Day-Std']].plot();
In [203]:
print('Augmented Dickey-Fuller Test on Daily Female Births')
dftest = adfuller(df2['Births'],autolag='AIC')
dfout = pd.Series(dftest[0:4],index=['ADF test statistic','p-value','# lags used','# observations'])

for key,val in dftest[4].items():
    dfout[f'critical value ({key})']=val
print(dfout)
Augmented Dickey-Fuller Test on Daily Female Births
ADF test statistic       -4.808291
p-value                   0.000052
# lags used               6.000000
# observations          358.000000
critical value (1%)      -3.448749
critical value (5%)      -2.869647
critical value (10%)     -2.571089
dtype: float64

In this case our p-value is very low at 0.000052, and we do reject the null hypothesis. This dataset appears to have no unit root, and is stationary.

Function for running the augmented Dickey-Fuller test

Since we'll use it frequently in the upcoming forecasts, let's define a function we can copy into future notebooks for running the augmented Dickey-Fuller test. Remember that we'll still have to import adfuller at the top of our notebook.

In [17]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")

Let's run this function:

In [206]:
adf_test(df1['Thousands of Passengers'])
Augmented Dickey-Fuller Test: 
ADF test statistic        0.815369
p-value                   0.991880
# lags used              13.000000
# observations          130.000000
critical value (1%)      -3.481682
critical value (5%)      -2.884042
critical value (10%)     -2.578770
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary
In [207]:
adf_test(df2['Births'])
Augmented Dickey-Fuller Test: 
ADF test statistic       -4.808291
p-value                   0.000052
# lags used               6.000000
# observations          358.000000
critical value (1%)      -3.448749
critical value (5%)      -2.869647
critical value (10%)     -2.571089
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Granger Causality Tests

The Granger Cauality test tells us wether a series can be considered to be able to forecast another series. It will test it with various tests such as F-test and chi2-test

In [ ]:
df3 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/samples.csv',index_col=0, parse_dates=True)
df3.index.freq = 'MS'
df3.head()
In [218]:
grangercausalitytests(df3[['a','c']],maxlag=3)
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.1292  , p=0.7199  , df_denom=116, df_num=1
ssr based chi2 test:   chi2=0.1325  , p=0.7158  , df=1
likelihood ratio test: chi2=0.1325  , p=0.7159  , df=1
parameter F test:         F=0.1292  , p=0.7199  , df_denom=116, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.8111  , p=0.4469  , df_denom=113, df_num=2
ssr based chi2 test:   chi2=1.6940  , p=0.4287  , df=2
likelihood ratio test: chi2=1.6820  , p=0.4313  , df=2
parameter F test:         F=0.8111  , p=0.4469  , df_denom=113, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.4467  , p=0.7201  , df_denom=110, df_num=3
ssr based chi2 test:   chi2=1.4255  , p=0.6996  , df=3
likelihood ratio test: chi2=1.4169  , p=0.7016  , df=3
parameter F test:         F=0.4467  , p=0.7201  , df_denom=110, df_num=3
Out[218]:
{1: ({'ssr_ftest': (0.12918575915661892, 0.7199312293830932, 116.0, 1),
   'ssr_chi2test': (0.13252677016929013, 0.715826449244265, 1),
   'lrtest': (0.13245302934637948, 0.7159020901722554, 1),
   'params_ftest': (0.12918575915736052, 0.7199312293823164, 116.0, 1.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f4edac562b0>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f4edac56470>,
   array([[0., 1., 0.]])]),
 2: ({'ssr_ftest': (0.8111078932183546, 0.44693552581960305, 113.0, 2),
   'ssr_chi2test': (1.6939952460135548, 0.42870012105264843, 2),
   'lrtest': (1.6819509736778855, 0.43128960083084733, 2),
   'params_ftest': (0.8111078932163238, 0.4469355258204989, 113.0, 2.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f4edac56c18>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f4edac56eb8>,
   array([[0., 0., 1., 0., 0.],
          [0., 0., 0., 1., 0.]])]),
 3: ({'ssr_ftest': (0.44673882003616905, 0.7200959797468032, 110.0, 3),
   'ssr_chi2test': (1.425502962115412, 0.6995677345603015, 3),
   'lrtest': (1.4168888483621913, 0.7015807104329939, 3),
   'params_ftest': (0.44673882005686966, 0.7200959797322368, 110.0, 3.0)},
  [<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f4edac567f0>,
   <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f4edac56cc0>,
   array([[0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0.]])])}

We need to interpret de p-values and since all of them are >0.05, we can say there is no time-causality in the time series

Evaluating forecast accuracy

Two calculations related to linear regression are mean squared error (MSE) and root mean squared error (RMSE)

The formula for the mean squared error is

    $MSE = {\frac 1 L} \sum\limits_{l=1}^L (y_{T+l} - \hat y_{T+l})^2$

where $T$ is the last observation period and $l$ is the lag point up to $L$ number of test observations.

The formula for the root mean squared error is

    $RMSE = \sqrt{MSE} = \sqrt{{\frac 1 L} \sum\limits_{l=1}^L (y_{T+l} - \hat y_{T+l})^2}$

The advantage of the RMSE is that it is expressed in the same units as the data.

A method similar to the RMSE is the mean absolute error (MAE) which is the mean of the magnitudes of the error, given as

    $MAE = {\frac 1 L} \sum\limits_{l=1}^L \mid{y_{T+l}} - \hat y_{T+l}\mid$

A forecast method that minimizes the MAE will lead to forecasts of the median, while minimizing the RMSE will lead to forecasts of the mean.

let's create some random data

In [219]:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(20,30,(50,2)),columns=['test','predictions'])
df.plot(figsize=(12,4));
In [220]:
df.head()
Out[220]:
test predictions
0 26 23
1 27 24
2 26 29
3 22 26
4 27 24
In [221]:
from statsmodels.tools.eval_measures import mse,rmse,meanabs

Let's use the most used metric for evaluating prediction accuracy

In [222]:
rmse(df['test'],df['predictions'])
Out[222]:
4.125530268947253
In [224]:
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)
df.index.freq = 'MS'

let's take the data set and check monthly data

In [226]:
from statsmodels.graphics.tsaplots import month_plot, quarter_plot

month_plot(df['Thousands of Passengers']);

This displays the range of values and their means (from the whole data series).

To check the quarter data, we will use:

In [230]:
dfq = df["Thousands of Passengers"].resample(rule='Q').mean()
quarter_plot(dfq);

Choosing ARIMA Orders

  • Goals
    • Understand PDQ terms for ARIMA (slides)
    • Understand how to choose orders manually from ACF and PACF
    • Understand how to use automatic order selection techniques using the functions below

Before we can apply an ARIMA forecasting model, we need to review the components of one.
ARIMA, or Autoregressive Independent Moving Average is actually a combination of 3 models:

  • AR(p) Autoregression - a regression model that utilizes the dependent relationship between a current observation and observations over a previous period.
  • I(d) Integration - uses differencing of observations (subtracting an observation from an observation at the previous time step) in order to make the time series stationary. In other words how many times we need to difference the data to get it stationary so the AR and MA components can work.
  • MA(q) Moving Average - a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.It indicates that the regressions error is a linear combination of error terms. We will set up another regression model that focuses on the residual term between a moving average and the real values

image.png

Related Functions:

pmdarima.auto_arima(y[,start_p,d,start_q, …])   Returns the optimal order for an ARIMA model

Optional Function (see note below):

stattools.arma_order_select_ic(y[, max_ar, …])  Returns information criteria for many ARMA models
x13.x13_arima_select_order(endog[, …])          Perform automatic seasonal ARIMA order identification using x12/x13 ARIMA
In [5]:
# Load a non-stationary dataset
df1 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/airline_passengers.csv',index_col='Month',parse_dates=True)
df1.index.freq = 'MS'

# Load a stationary dataset
df2 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/DailyTotalFemaleBirths.csv',index_col='Date',parse_dates=True)
df2.index.freq = 'D'
In [6]:
from pmdarima import auto_arima

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
In [7]:
help(auto_arima)
Help on function auto_arima in module pmdarima.arima.auto:

auto_arima(y, exogenous=None, start_p=2, d=None, start_q=2, max_p=5, max_d=2, max_q=5, start_P=1, D=None, start_Q=1, max_P=2, max_D=1, max_Q=2, max_order=5, m=1, seasonal=True, stationary=False, information_criterion='aic', alpha=0.05, test='kpss', seasonal_test='ocsb', stepwise=True, n_jobs=1, start_params=None, trend=None, method='lbfgs', maxiter=50, offset_test_args=None, seasonal_test_args=None, suppress_warnings=False, error_action='trace', trace=False, random=False, random_state=None, n_fits=10, return_valid_fits=False, out_of_sample_size=0, scoring='mse', scoring_args=None, with_intercept=True, sarimax_kwargs=None, **fit_args)
    Automatically discover the optimal order for an ARIMA model.
    
    The auto-ARIMA process seeks to identify the most optimal
    parameters for an ``ARIMA`` model, settling on a single fitted ARIMA model.
    This process is based on the commonly-used R function,
    ``forecast::auto.arima`` [3].
    
    Auto-ARIMA works by conducting differencing tests (i.e.,
    Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or
    Phillips–Perron) to determine the order of differencing, ``d``, and then
    fitting models within ranges of defined ``start_p``, ``max_p``,
    ``start_q``, ``max_q`` ranges. If the ``seasonal`` optional is enabled,
    auto-ARIMA also seeks to identify the optimal ``P`` and ``Q`` hyper-
    parameters after conducting the Canova-Hansen to determine the optimal
    order of seasonal differencing, ``D``.
    
    In order to find the best model, auto-ARIMA optimizes for a given
    ``information_criterion``, one of ('aic', 'aicc', 'bic', 'hqic', 'oob')
    (Akaike Information Criterion, Corrected Akaike Information Criterion,
    Bayesian Information Criterion, Hannan-Quinn Information Criterion, or
    "out of bag"--for validation scoring--respectively) and returns the ARIMA
    which minimizes the value.
    
    Note that due to stationarity issues, auto-ARIMA might not find a
    suitable model that will converge. If this is the case, a ``ValueError``
    will be thrown suggesting stationarity-inducing measures be taken prior
    to re-fitting or that a new range of ``order`` values be selected. Non-
    stepwise (i.e., essentially a grid search) selection can be slow,
    especially for seasonal data. Stepwise algorithm is outlined in Hyndman and
    Khandakar (2008).
    
    Parameters
    ----------
    y : array-like or iterable, shape=(n_samples,)
        The time-series to which to fit the ``ARIMA`` estimator. This may
        either be a Pandas ``Series`` object (statsmodels can internally
        use the dates in the index), or a numpy array. This should be a
        one-dimensional array of floats, and should not contain any
        ``np.nan`` or ``np.inf`` values.
    
    exogenous : array-like, shape=[n_obs, n_vars], optional (default=None)
        An optional 2-d array of exogenous variables. If provided, these
        variables are used as additional features in the regression
        operation. This should not include a constant or trend. Note that
        if an ``ARIMA`` is fit on exogenous features, it must be provided
        exogenous features for making predictions.
    
    start_p : int, optional (default=2)
        The starting value of ``p``, the order (or number of time lags)
        of the auto-regressive ("AR") model. Must be a positive integer.
    
    d : int, optional (default=None)
        The order of first-differencing. If None (by default), the value
        will automatically be selected based on the results of the ``test``
        (i.e., either the Kwiatkowski–Phillips–Schmidt–Shin, Augmented
        Dickey-Fuller or the Phillips–Perron test will be conducted to find
        the most probable value). Must be a positive integer or None. Note
        that if ``d`` is None, the runtime could be significantly longer.
    
    start_q : int, optional (default=2)
        The starting value of ``q``, the order of the moving-average
        ("MA") model. Must be a positive integer.
    
    max_p : int, optional (default=5)
        The maximum value of ``p``, inclusive. Must be a positive integer
        greater than or equal to ``start_p``.
    
    max_d : int, optional (default=2)
        The maximum value of ``d``, or the maximum number of non-seasonal
        differences. Must be a positive integer greater than or equal to ``d``.
    
    max_q : int, optional (default=5)
        The maximum value of ``q``, inclusive. Must be a positive integer
        greater than ``start_q``.
    
    start_P : int, optional (default=1)
        The starting value of ``P``, the order of the auto-regressive portion
        of the seasonal model.
    
    D : int, optional (default=None)
        The order of the seasonal differencing. If None (by default, the value
        will automatically be selected based on the results of the
        ``seasonal_test``. Must be a positive integer or None.
    
    start_Q : int, optional (default=1)
        The starting value of ``Q``, the order of the moving-average portion
        of the seasonal model.
    
    max_P : int, optional (default=2)
        The maximum value of ``P``, inclusive. Must be a positive integer
        greater than ``start_P``.
    
    max_D : int, optional (default=1)
        The maximum value of ``D``. Must be a positive integer greater
        than ``D``.
    
    max_Q : int, optional (default=2)
        The maximum value of ``Q``, inclusive. Must be a positive integer
        greater than ``start_Q``.
    
    max_order : int, optional (default=5)
        Maximum value of p+q+P+Q if model selection is not stepwise.
        If the sum of ``p`` and ``q`` is >= ``max_order``, a model will
        *not* be fit with those parameters, but will progress to the next
        combination. Default is 5. If ``max_order`` is None, it means there
        are no constraints on maximum order.
    
    m : int, optional (default=1)
        The period for seasonal differencing, ``m`` refers to the number of
        periods in each season. For example, ``m`` is 4 for quarterly data, 12
        for monthly data, or 1 for annual (non-seasonal) data. Default is 1.
        Note that if ``m`` == 1 (i.e., is non-seasonal), ``seasonal`` will be
        set to False. For more information on setting this parameter, see
        :ref:`period`.
    
    seasonal : bool, optional (default=True)
        Whether to fit a seasonal ARIMA. Default is True. Note that if
        ``seasonal`` is True and ``m`` == 1, ``seasonal`` will be set to False.
    
    stationary : bool, optional (default=False)
        Whether the time-series is stationary and ``d`` should be set to zero.
    
    information_criterion : str, optional (default='aic')
        The information criterion used to select the best ARIMA model. One of
        ``pmdarima.arima.auto_arima.VALID_CRITERIA``, ('aic', 'bic', 'hqic',
        'oob').
    
    alpha : float, optional (default=0.05)
        Level of the test for testing significance.
    
    test : str, optional (default='kpss')
        Type of unit root test to use in order to detect stationarity if
        ``stationary`` is False and ``d`` is None. Default is 'kpss'
        (Kwiatkowski–Phillips–Schmidt–Shin).
    
    seasonal_test : str, optional (default='ocsb')
        This determines which seasonal unit root test is used if ``seasonal``
        is True and ``D`` is None. Default is 'OCSB'.
    
    stepwise : bool, optional (default=True)
        Whether to use the stepwise algorithm outlined in Hyndman and Khandakar
        (2008) to identify the optimal model parameters. The stepwise algorithm
        can be significantly faster than fitting all (or a ``random`` subset
        of) hyper-parameter combinations and is less likely to over-fit
        the model.
    
    n_jobs : int, optional (default=1)
        The number of models to fit in parallel in the case of a grid search
        (``stepwise=False``). Default is 1, but -1 can be used to designate
        "as many as possible".
    
    start_params : array-like, optional (default=None)
        Starting parameters for ``ARMA(p,q)``.  If None, the default is given
        by ``ARMA._fit_start_params``.
    
    method : str, optional (default='lbfgs')
        The ``method`` determines which solver from ``scipy.optimize``
        is used, and it can be chosen from among the following strings:
    
        - 'newton' for Newton-Raphson
        - 'nm' for Nelder-Mead
        - 'bfgs' for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
        - 'lbfgs' for limited-memory BFGS with optional box constraints
        - 'powell' for modified Powell's method
        - 'cg' for conjugate gradient
        - 'ncg' for Newton-conjugate gradient
        - 'basinhopping' for global basin-hopping solver
    
        The explicit arguments in ``fit`` are passed to the solver,
        with the exception of the basin-hopping solver. Each
        solver has several optional arguments that are not the same across
        solvers. These can be passed as **fit_kwargs
    
    trend : str or None, optional (default=None)
        The trend parameter. If ``with_intercept`` is True, ``trend`` will be
        used. If ``with_intercept`` is False, the trend will be set to a no-
        intercept value.
    
    maxiter : int, optional (default=50)
        The maximum number of function evaluations. Default is 50.
    
    offset_test_args : dict, optional (default=None)
        The args to pass to the constructor of the offset (``d``) test. See
        ``pmdarima.arima.stationarity`` for more details.
    
    seasonal_test_args : dict, optional (default=None)
        The args to pass to the constructor of the seasonal offset (``D``)
        test. See ``pmdarima.arima.seasonality`` for more details.
    
    suppress_warnings : bool, optional (default=False)
        Many warnings might be thrown inside of statsmodels. If
        ``suppress_warnings`` is True, all of the warnings coming from
        ``ARIMA`` will be squelched.
    
    error_action : str, optional (default='warn')
        If unable to fit an ``ARIMA`` due to stationarity issues, whether to
        warn ('warn'), raise the ``ValueError`` ('raise') or ignore ('ignore').
        Note that the default behavior is to warn, and fits that fail will be
        returned as None. This is the recommended behavior, as statsmodels
        ARIMA and SARIMAX models hit bugs periodically that can cause
        an otherwise healthy parameter combination to fail for reasons not
        related to pmdarima.
    
    trace : bool or int, optional (default=False)
        Whether to print status on the fits. A value of False will print no
        debugging information. A value of True will print some. Integer values
        exceeding 1 will print increasing amounts of debug information at each
        fit.
    
    random : bool, optional (default=False)
        Similar to grid searches, ``auto_arima`` provides the capability to
        perform a "random search" over a hyper-parameter space. If ``random``
        is True, rather than perform an exhaustive search or ``stepwise``
        search, only ``n_fits`` ARIMA models will be fit (``stepwise`` must be
        False for this option to do anything).
    
    random_state : int, long or numpy ``RandomState``, optional (default=None)
        The PRNG for when ``random=True``. Ensures replicable testing and
        results.
    
    n_fits : int, optional (default=10)
        If ``random`` is True and a "random search" is going to be performed,
        ``n_iter`` is the number of ARIMA models to be fit.
    
    return_valid_fits : bool, optional (default=False)
        If True, will return all valid ARIMA fits in a list. If False (by
        default), will only return the best fit.
    
    out_of_sample_size : int, optional (default=0)
        The ``ARIMA`` class can fit only a portion of the data if specified,
        in order to retain an "out of bag" sample score. This is the
        number of examples from the tail of the time series to hold out
        and use as validation examples. The model will not be fit on these
        samples, but the observations will be added into the model's ``endog``
        and ``exog`` arrays so that future forecast values originate from the
        end of the endogenous vector.
    
        For instance::
    
            y = [0, 1, 2, 3, 4, 5, 6]
            out_of_sample_size = 2
    
            > Fit on: [0, 1, 2, 3, 4]
            > Score on: [5, 6]
            > Append [5, 6] to end of self.arima_res_.data.endog values
    
    scoring : str, optional (default='mse')
        If performing validation (i.e., if ``out_of_sample_size`` > 0), the
        metric to use for scoring the out-of-sample data. One of ('mse', 'mae')
    
    scoring_args : dict, optional (default=None)
        A dictionary of key-word arguments to be passed to the ``scoring``
        metric.
    
    with_intercept : bool, optional (default=True)
        Whether to include an intercept term. Default is True.
    
    
    sarimax_kwargs : dict or None, optional (default=None)
        Keyword arguments to pass to the ARIMA constructor.
    
    
    **fit_args : dict, optional (default=None)
        A dictionary of keyword arguments to pass to the :func:`ARIMA.fit`
        method.
    
    See Also
    --------
    :func:`pmdarima.arima.ARIMA`
    
    Notes
    -----
    * Fitting with `stepwise=False` can prove slower, especially when
      `seasonal=True`.
    
    References
    ----------
    .. [1] https://wikipedia.org/wiki/Autoregressive_integrated_moving_average
    .. [2] R's auto-arima source code: https://github.com/robjhyndman/forecast/blob/master/R/arima.R
    .. [3] R's auto-arima documentation: https://www.rdocumentation.org/packages/forecast

Auto_arima, based on the AIC, will give you the mest combination of p,d,q parameters

In [8]:
auto_arima(df2['Births'])
Out[8]:
ARIMA(maxiter=50, method='lbfgs', order=(1, 1, 1), out_of_sample_size=0,
      scoring='mse', scoring_args=None, seasonal_order=(0, 0, 0, 0),
      start_params=None, suppress_warnings=False, trend=None,
      with_intercept=True)

We specify we want the model to try from 0 to 6 p and from 0 to 3

In [11]:
stepwise_fit = auto_arima(df2['Births'], start_p=0, start_q=0, max_p=6, max_q=3, seasonal=False, trace=True)
Performing stepwise search to minimize aic
Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=True]; AIC=2650.760, BIC=2658.555, Time=0.026 seconds
Fit ARIMA(1,1,0)x(0,0,0,0) [intercept=True]; AIC=2565.234, BIC=2576.925, Time=0.060 seconds
Fit ARIMA(0,1,1)x(0,0,0,0) [intercept=True]; AIC=2463.584, BIC=2475.275, Time=0.128 seconds
Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=False]; AIC=2648.768, BIC=2652.665, Time=0.015 seconds
Fit ARIMA(1,1,1)x(0,0,0,0) [intercept=True]; AIC=2460.154, BIC=2475.743, Time=0.217 seconds
Fit ARIMA(2,1,1)x(0,0,0,0) [intercept=True]; AIC=2461.271, BIC=2480.757, Time=0.312 seconds
Fit ARIMA(1,1,2)x(0,0,0,0) [intercept=True]; AIC=2460.810, BIC=2480.296, Time=1.050 seconds
Near non-invertible roots for order (1, 1, 2)(0, 0, 0, 0); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.996)
Fit ARIMA(0,1,2)x(0,0,0,0) [intercept=True]; AIC=2460.722, BIC=2476.311, Time=0.445 seconds
Fit ARIMA(2,1,0)x(0,0,0,0) [intercept=True]; AIC=2536.154, BIC=2551.742, Time=0.182 seconds
Fit ARIMA(2,1,2)x(0,0,0,0) [intercept=True]; AIC=2463.066, BIC=2486.449, Time=1.325 seconds
Total fit time: 3.774 seconds

For our data, the best p,d,q parameters found are 1,1,1. Note that auto_arima did not try all of the parameters, since he/she realised that the AIC was actually not improving,so it stopped trying more combinations

This shows a recommended (p,d,q) ARIMA Order of (1,1,1), with no seasonal_order component.

We can see how this was determined by looking at the stepwise results. The recommended order is the one with the lowest Akaike information criterion or AIC score. Note that the recommended model may not be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best forecasting model.

Let's have a look now at the description of the winner parameter-combination

NOTE: Harmless warnings should have been suppressed, but if you see an error citing unusual behavior you can suppress this message by passing error_action='ignore' into auto_arima(). Also, auto_arima().summary() provides a nicely formatted summary table.
In [12]:
stepwise_fit.summary()
Out[12]:
SARIMAX Results
Dep. Variable: y No. Observations: 365
Model: SARIMAX(1, 1, 1) Log Likelihood -1226.077
Date: Sun, 03 May 2020 AIC 2460.154
Time: 09:36:11 BIC 2475.743
Sample: 0 HQIC 2466.350
- 365
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0132 0.014 0.975 0.330 -0.013 0.040
ar.L1 0.1299 0.059 2.217 0.027 0.015 0.245
ma.L1 -0.9694 0.016 -62.235 0.000 -1.000 -0.939
sigma2 48.9989 3.432 14.279 0.000 42.273 55.725
Ljung-Box (Q): 36.69 Jarque-Bera (JB): 26.17
Prob(Q): 0.62 Prob(JB): 0.00
Heteroskedasticity (H): 0.97 Skew: 0.58
Prob(H) (two-sided): 0.85 Kurtosis: 3.62


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Now let's look at the non-stationary, seasonal Airline Passengers dataset: Note that since we set seasonal=True, the model is also runing a SARIMA model

In [13]:
# With Trace=True we can see the trace results of the parameter combination that the model is using
# m is the type of differencing, so if we difference on quarterly daya, we will set m=4, for monthly m=12, yearly m=1
stepwise_fit = auto_arima(df1['Thousands of Passengers'], start_p=1, start_q=1,
                          max_p=3, max_q=3, m=12,
                          start_P=0, seasonal=True,
                          d=None, D=1, trace=True,
                          error_action='ignore',   # we don't want to know if an order does not work
                          suppress_warnings=True,  # we don't want convergence warnings
                          stepwise=True)           # set to stepwise

stepwise_fit.summary()
Performing stepwise search to minimize aic
Fit ARIMA(1,1,1)x(0,1,1,12) [intercept=True]; AIC=1024.824, BIC=1039.200, Time=1.220 seconds
Fit ARIMA(0,1,0)x(0,1,0,12) [intercept=True]; AIC=1033.479, BIC=1039.229, Time=0.045 seconds
Fit ARIMA(1,1,0)x(1,1,0,12) [intercept=True]; AIC=1022.316, BIC=1033.817, Time=0.839 seconds
Fit ARIMA(0,1,1)x(0,1,1,12) [intercept=True]; AIC=1022.904, BIC=1034.405, Time=0.978 seconds
Fit ARIMA(0,1,0)x(0,1,0,12) [intercept=False]; AIC=1031.508, BIC=1034.383, Time=0.045 seconds
Fit ARIMA(1,1,0)x(0,1,0,12) [intercept=True]; AIC=1022.343, BIC=1030.968, Time=0.137 seconds
Fit ARIMA(1,1,0)x(2,1,0,12) [intercept=True]; AIC=1021.137, BIC=1035.513, Time=2.400 seconds
Fit ARIMA(1,1,0)x(2,1,1,12) [intercept=True]; AIC=1017.167, BIC=1034.418, Time=6.912 seconds
Near non-invertible roots for order (1, 1, 0)(2, 1, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.998)
Fit ARIMA(1,1,0)x(1,1,1,12) [intercept=True]; AIC=1022.410, BIC=1036.786, Time=1.493 seconds
Fit ARIMA(0,1,0)x(2,1,0,12) [intercept=True]; AIC=1034.067, BIC=1045.568, Time=1.637 seconds
Fit ARIMA(2,1,0)x(2,1,0,12) [intercept=True]; AIC=1023.007, BIC=1040.258, Time=5.339 seconds
Fit ARIMA(1,1,1)x(2,1,0,12) [intercept=True]; AIC=1022.905, BIC=1040.156, Time=4.271 seconds
Fit ARIMA(0,1,1)x(2,1,0,12) [intercept=True]; AIC=1021.017, BIC=1035.393, Time=2.567 seconds
Fit ARIMA(0,1,1)x(1,1,0,12) [intercept=True]; AIC=1022.314, BIC=1033.815, Time=0.985 seconds
Fit ARIMA(0,1,1)x(2,1,1,12) [intercept=True]; AIC=1015.841, BIC=1033.092, Time=9.209 seconds
Near non-invertible roots for order (0, 1, 1)(2, 1, 1, 12); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.998)
Fit ARIMA(0,1,1)x(1,1,1,12) [intercept=True]; AIC=1022.207, BIC=1036.583, Time=1.942 seconds
Fit ARIMA(0,1,2)x(2,1,0,12) [intercept=True]; AIC=1022.996, BIC=1040.247, Time=2.853 seconds
Fit ARIMA(1,1,2)x(2,1,0,12) [intercept=True]; AIC=1024.668, BIC=1044.795, Time=5.043 seconds
Total fit time: 47.950 seconds
Out[13]:
SARIMAX Results
Dep. Variable: y No. Observations: 144
Model: SARIMAX(0, 1, 1)x(2, 1, 1, 12) Log Likelihood -501.920
Date: Sun, 03 May 2020 AIC 1015.841
Time: 09:42:20 BIC 1033.092
Sample: 0 HQIC 1022.851
- 144
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0003 0.033 0.010 0.992 -0.064 0.064
ma.L1 -0.4237 0.069 -6.177 0.000 -0.558 -0.289
ar.S.L12 0.6676 0.161 4.149 0.000 0.352 0.983
ar.S.L24 0.3310 0.095 3.480 0.001 0.145 0.517
ma.S.L12 -0.9748 1.252 -0.779 0.436 -3.428 1.478
sigma2 110.8423 115.690 0.958 0.338 -115.906 337.591
Ljung-Box (Q): 53.05 Jarque-Bera (JB): 7.56
Prob(Q): 0.08 Prob(JB): 0.02
Heteroskedasticity (H): 2.82 Skew: 0.10
Prob(H) (two-sided): 0.00 Kurtosis: 4.16


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

The winner is a SARIMA (0,1,1)x(2,1,12) (p,d,q)x(P,D,Q,M). We will talk about SARIMA later

In the table you can see the coefficients for each of the AR & MA lags

ARMA(p,q) and ARIMA(p,d,q)

Autoregressive Moving Averages

This section covers Autoregressive Moving Averages (ARMA) and Autoregressive Integrated Moving Averages (ARIMA).

Recall that an AR(1) model follows the formula

    $y_{t} = c + \phi_{1}y_{t-1} + \varepsilon_{t}$

while an MA(1) model follows the formula

    $y_{t} = \mu + \theta_{1}\varepsilon_{t-1} + \varepsilon_{t}$

where $c$ is a constant, $\mu$ is the expectation of $y_{t}$ (often assumed to be zero), $\phi_1$ (phi-sub-one) is the AR lag coefficient, $\theta_1$ (theta-sub-one) is the MA lag coefficient, and $\varepsilon$ (epsilon) is white noise.

An ARMA(1,1) model therefore follows

    $y_{t} = c + \phi_{1}y_{t-1} + \theta_{1}\varepsilon_{t-1} + \varepsilon_{t}$

ARMA models can be used on stationary datasets.

For non-stationary datasets with a trend component, ARIMA models apply a differencing coefficient as well.

Related Functions:

arima_model.ARMA(endog, order[, exog, …])     Autoregressive Moving Average ARMA(p,q) model
arima_model.ARMAResults(model, params[, …])   Class to hold results from fitting an ARMA model
arima_model.ARIMA(endog, order[, exog, …])    Autoregressive Integrated Moving Average ARIMA(p,d,q) model
arima_model.ARIMAResults(model, params[, …])  Class to hold results from fitting an ARIMA model
kalmanf.kalmanfilter.KalmanFilter             Kalman Filter code intended for use with the ARMA model

For Further Reading:

Wikipedia  Autoregressive–moving-average model
Forecasting: Principles and Practice  Non-seasonal ARIMA models

Autoregressive Moving Average - ARMA(p,q)

In this first section we'll look at a stationary dataset, determine (p,q) orders, and run a forecasting ARMA model fit to the data. In practice it's rare to find stationary data with no trend or seasonal component, but the first four months of the Daily Total Female Births dataset should work for our purposes.

Plot the source data

In [16]:
# Load specific forecasting tools
from statsmodels.tsa.arima_model import ARMA,ARMAResults,ARIMA,ARIMAResults
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from pmdarima import auto_arima # for determining ARIMA orders

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

# Load datasets
df1 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/DailyTotalFemaleBirths.csv',index_col='Date',parse_dates=True)
df1.index.freq = 'D'
df1 = df1[:120]  # we only want the first four months

df2 = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/TradeInventories.csv',index_col='Date',parse_dates=True)
df2.index.freq='MS'
In [18]:
# this function was created previously (control f adf_test)
adf_test(df1["Births"])
Augmented Dickey-Fuller Test: 
ADF test statistic     -9.855384e+00
p-value                 4.373545e-17
# lags used             0.000000e+00
# observations          1.190000e+02
critical value (1%)    -3.486535e+00
critical value (5%)    -2.886151e+00
critical value (10%)   -2.579896e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Determine the (p,q) ARMA Orders using pmdarima.auto_arima

This tool should give just $p$ and $q$ value recommendations for this dataset.

In [20]:
auto_arima(df1["Births"],seasonal=False).summary()
Out[20]:
SARIMAX Results
Dep. Variable: y No. Observations: 120
Model: SARIMAX Log Likelihood -409.745
Date: Sun, 03 May 2020 AIC 823.489
Time: 10:20:24 BIC 829.064
Sample: 0 HQIC 825.753
- 120
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 39.7833 0.687 57.896 0.000 38.437 41.130
sigma2 54.1197 8.319 6.506 0.000 37.815 70.424
Ljung-Box (Q): 44.41 Jarque-Bera (JB): 2.69
Prob(Q): 0.29 Prob(JB): 0.26
Heteroskedasticity (H): 0.80 Skew: 0.26
Prob(H) (two-sided): 0.48 Kurtosis: 2.48


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Split the data into train/test sets

As a general rule you should set the length of your test set equal to your intended forecast size. For this dataset we'll attempt a 1-month forecast.

In [21]:
# Set one month for testing
train = df1.iloc[:90]
test = df1.iloc[90:]

Fit an ARMA(p,q) Model

If you want you can run help(ARMA) to learn what incoming arguments are available/expected, and what's being returned.

In [22]:
model = ARMA(train['Births'],order=(2,2))
results = model.fit()
results.summary()
Out[22]:
ARMA Model Results
Dep. Variable: Births No. Observations: 90
Model: ARMA(2, 2) Log Likelihood -307.905
Method: css-mle S.D. of innovations 7.405
Date: Sun, 03 May 2020 AIC 627.809
Time: 10:21:56 BIC 642.808
Sample: 01-01-1959 HQIC 633.858
- 03-31-1959
coef std err z P>|z| [0.025 0.975]
const 39.7549 0.912 43.607 0.000 37.968 41.542
ar.L1.Births -0.1850 1.087 -0.170 0.865 -2.315 1.945
ar.L2.Births 0.4352 0.644 0.675 0.500 -0.828 1.698
ma.L1.Births 0.2777 1.097 0.253 0.800 -1.872 2.427
ma.L2.Births -0.3999 0.679 -0.589 0.556 -1.730 0.930
Roots
Real Imaginary Modulus Frequency
AR.1 -1.3181 +0.0000j 1.3181 0.5000
AR.2 1.7434 +0.0000j 1.7434 0.0000
MA.1 -1.2718 +0.0000j 1.2718 0.5000
MA.2 1.9662 +0.0000j 1.9662 0.0000

Obtain a month's worth of predicted values

In [23]:
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end).rename('ARMA(2,2) Predictions')

Plot predictions against known values

In [24]:
title = 'Daily Total Female Births'
ylabel='Births'
xlabel='' # we don't really need a label here

ax = test['Births'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

The orange line is the predicted values, which even though it looks like it is a very bad prediction, it makes completely sense since ARMA shows the average value. The model is not able to predict the noise, but is able to predict the average value.

In fact if we calculate the average value of test and predictions, they have the same average value:

In [25]:
test.mean()
Out[25]:
Births    39.833333
dtype: float64
In [26]:
predictions.mean()
Out[26]:
39.7774288668336

Now let's work on the ARIMA mode,so we will ad the I component into the ARMA model:


Autoregressive Integrated Moving Average - ARIMA(p,d,q)

The steps are the same as for ARMA(p,q), except that we'll apply a differencing component to make the dataset stationary.
First let's take a look at the Real Manufacturing and Trade Inventories dataset.

Plot the Source Data

In [28]:
# HERE'S A TRICK TO ADD COMMAS TO Y-AXIS TICK VALUES
import matplotlib.ticker as ticker

formatter = ticker.StrMethodFormatter('{x:,.0f}')

title = 'Real Manufacturing and Trade Inventories'
ylabel='Chained 2012 Dollars'
xlabel='' # we don't really need a label here

ax = df2['Inventories'].plot(figsize=(12,5),title=title)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
ax.yaxis.set_major_formatter(formatter);

Run an ETS Decomposition

We probably won't learn a lot from it, but it never hurts to run an ETS Decomposition plot.

In [29]:
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df2['Inventories'], model='additive')  # model='add' also works
result.plot();

Here we see that the seasonal component does not contribute significantly to the behavior of the series.

Use pmdarima.auto_arima to determine ARIMA Orders

So for the purpose of this code, we will ignore the seasonal component (otherwise we should use SARIMA, instead of ARIMA)

In [30]:
auto_arima(df2['Inventories'],seasonal=False).summary()
Out[30]:
SARIMAX Results
Dep. Variable: y No. Observations: 264
Model: SARIMAX(0, 1, 0) Log Likelihood -2672.018
Date: Sun, 03 May 2020 AIC 5348.037
Time: 10:33:03 BIC 5355.181
Sample: 0 HQIC 5350.908
- 264
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 3258.3802 470.991 6.918 0.000 2335.255 4181.506
sigma2 3.91e+07 2.95e+06 13.250 0.000 3.33e+07 4.49e+07
Ljung-Box (Q): 455.75 Jarque-Bera (JB): 100.74
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.86 Skew: -1.15
Prob(H) (two-sided): 0.48 Kurtosis: 4.98


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

This suggests that we should fit an SARIMAX(0,1,0) model to best forecast future values of the series. Before we train the model, let's look at augmented Dickey-Fuller Test, and the ACF/PACF plots to see if they agree. These steps are optional, and we would likely skip them in practice.

In [31]:
from statsmodels.tsa.statespace.tools import diff
df2['d1'] = diff(df2['Inventories'],k_diff=1)

# Equivalent to:
# df1['d1'] = df1['Inventories'] - df1['Inventories'].shift(1)

adf_test(df2['d1'],'Real Manufacturing and Trade Inventories')
Augmented Dickey-Fuller Test: Real Manufacturing and Trade Inventories
ADF test statistic       -3.412249
p-value                   0.010548
# lags used               4.000000
# observations          258.000000
critical value (1%)      -3.455953
critical value (5%)      -2.872809
critical value (10%)     -2.572775
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

This confirms that we reached stationarity after the first difference.

Run the ACF and PACF plots

A PACF Plot can reveal recommended AR(p) orders, and an ACF Plot can do the same for MA(q) orders.
Alternatively, we can compare the stepwise Akaike Information Criterion (AIC) values across a set of different (p,q) combinations to choose the best combination.

In [32]:
title = 'Autocorrelation: Real Manufacturing and Trade Inventories'
lags = 40
plot_acf(df2['Inventories'],title=title,lags=lags);
In [33]:
title = 'Partial Autocorrelation: Real Manufacturing and Trade Inventories'
lags = 40
plot_pacf(df2['Inventories'],title=title,lags=lags);

This tells us that the AR component should be more important than MA. From the Duke University Statistical Forecasting site:

If the PACF displays a sharp cutoff while the ACF decays more slowly (i.e., has significant spikes at higher lags), we say that the stationarized series displays an "AR signature," meaning that the autocorrelation pattern can be explained more easily by adding AR terms than by adding MA terms.

Let's take a look at pmdarima.auto_arima done stepwise to see if having $p$ and $q$ terms the same still makes sense:

In [34]:
stepwise_fit = auto_arima(df2['Inventories'], start_p=0, start_q=0,
                          max_p=2, max_q=2, m=12,
                          seasonal=False,
                          d=None, trace=True,
                          error_action='ignore',   # we don't want to know if an order does not work
                          suppress_warnings=True,  # we don't want convergence warnings
                          stepwise=True)           # set to stepwise

stepwise_fit.summary()
Performing stepwise search to minimize aic
Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=True]; AIC=5348.037, BIC=5355.181, Time=0.043 seconds
Fit ARIMA(1,1,0)x(0,0,0,0) [intercept=True]; AIC=5399.843, BIC=5410.560, Time=0.106 seconds
Fit ARIMA(0,1,1)x(0,0,0,0) [intercept=True]; AIC=5350.241, BIC=5360.957, Time=0.098 seconds
Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=False]; AIC=5409.217, BIC=5412.789, Time=0.032 seconds
Fit ARIMA(1,1,1)x(0,0,0,0) [intercept=True]; AIC=5378.835, BIC=5393.124, Time=0.503 seconds
Total fit time: 0.790 seconds
Out[34]:
SARIMAX Results
Dep. Variable: y No. Observations: 264
Model: SARIMAX(0, 1, 0) Log Likelihood -2672.018
Date: Sun, 03 May 2020 AIC 5348.037
Time: 10:39:27 BIC 5355.181
Sample: 0 HQIC 5350.908
- 264
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 3258.3802 470.991 6.918 0.000 2335.255 4181.506
sigma2 3.91e+07 2.95e+06 13.250 0.000 3.33e+07 4.49e+07
Ljung-Box (Q): 455.75 Jarque-Bera (JB): 100.74
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.86 Skew: -1.15
Prob(H) (two-sided): 0.48 Kurtosis: 4.98


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Split the data into train/test sets

In [35]:
len(df2)
Out[35]:
264
In [36]:
# Set one year for testing
train = df2.iloc[:252]
test = df2.iloc[252:]

Fit an ARIMA(1,1,1) Model

In [38]:
model = ARIMA(train['Inventories'],order=(1,1,1))
results = model.fit()
results.summary()
Out[38]:
ARIMA Model Results
Dep. Variable: D.Inventories No. Observations: 251
Model: ARIMA(1, 1, 1) Log Likelihood -2486.394
Method: css-mle S.D. of innovations 4845.007
Date: Sun, 03 May 2020 AIC 4980.788
Time: 10:57:38 BIC 4994.890
Sample: 02-01-1997 HQIC 4986.463
- 12-01-2017
coef std err z P>|z| [0.025 0.975]
const 3235.7979 1344.940 2.406 0.016 599.764 5871.832
ar.L1.D.Inventories 0.9026 0.039 23.012 0.000 0.826 0.979
ma.L1.D.Inventories -0.5581 0.079 -7.048 0.000 -0.713 -0.403
Roots
Real Imaginary Modulus Frequency
AR.1 1.1080 +0.0000j 1.1080 0.0000
MA.1 1.7917 +0.0000j 1.7917 0.0000
In [39]:
# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('ARIMA(1,1,1) Predictions')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values).

Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.

For more information on these arguments visit https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.predict.html

In [41]:
# Compare predictions to expected values
for i in range(len(predictions)):
    print(f"predicted={predictions[i]:<11.10}, expected={test['Inventories'][i]}")
predicted=2107156.741, expected=2110158
predicted=2110545.923, expected=2118199
predicted=2113920.16 , expected=2112427
predicted=2117280.909, expected=2112276
predicted=2120629.483, expected=2111835
predicted=2123967.068, expected=2109298
predicted=2127294.736, expected=2119618
predicted=2130613.453, expected=2127170
predicted=2133924.091, expected=2134172
predicted=2137227.436, expected=2144639
predicted=2140524.201, expected=2143001
predicted=2143815.024, expected=2158115
In [42]:
# Plot predictions against known values
title = 'Real Manufacturing and Trade Inventories'
ylabel='Chained 2012 Dollars'
xlabel='' # we don't really need a label here

ax = test['Inventories'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
ax.yaxis.set_major_formatter(formatter);

Evaluate the Model

In [43]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(test['Inventories'], predictions)
print(f'ARIMA(1,1,1) MSE Error: {error:11.10}')
ARIMA(1,1,1) MSE Error: 60315541.59
In [44]:
from statsmodels.tools.eval_measures import rmse

error = rmse(test['Inventories'], predictions)
print(f'ARIMA(1,1,1) RMSE Error: {error:11.10}')
ARIMA(1,1,1) RMSE Error: 7766.308105

Retrain the model on the full data, and forecast the future

In [45]:
model = ARIMA(df2['Inventories'],order=(1,1,1))
results = model.fit()
fcast = results.predict(len(df2),len(df2)+11,typ='levels').rename('ARIMA(1,1,1) Forecast')
In [46]:
# Plot predictions against known values
title = 'Real Manufacturing and Trade Inventories'
ylabel='Chained 2012 Dollars'
xlabel='' # we don't really need a label here

ax = df2['Inventories'].plot(legend=True,figsize=(12,6),title=title)
fcast.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
ax.yaxis.set_major_formatter(formatter);

SARIMA(p,d,q)(P,D,Q)m

Seasonal Autoregressive Integrated Moving Averages

We have finally reached one of the most fascinating aspects of time series analysis: seasonality.

Where ARIMA accepts the parameters $(p,d,q)$, SARIMA accepts an additional set of parameters $(P,D,Q)m$ that specifically describe the seasonal components of the model. Here $P$, $D$ and $Q$ represent the seasonal regression, differencing and moving average coefficients, and $m$ represents the number of data points (rows) in each seasonal cycle.

NOTE: The statsmodels implementation of SARIMA is called SARIMAX. The “X” added to the name means that the function also supports exogenous regressor variables. We'll cover these in the next section.

Related Functions:

sarimax.SARIMAX(endog[, exog, order, …])       
sarimax.SARIMAXResults(model, params, …[, …])  Class to hold results from fitting a SARIMAX model.

For Further Reading:

Statsmodels Tutorial:  Time Series Analysis by State Space Methods

Perform standard imports and load datasets

In [47]:
# Load specific forecasting tools
from statsmodels.tsa.statespace.sarimax import SARIMAX

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from statsmodels.tsa.seasonal import seasonal_decompose      # for ETS Plots
from pmdarima import auto_arima                              # for determining ARIMA orders

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

# Load dataset
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/co2_mm_mlo.csv')
In [ ]:
df.head()

We need to combine two integer columns (year and month) into a DatetimeIndex. We can do this by passing a dictionary into pandas.to_datetime() with year, month and day values.
For more information visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

In [48]:
# Add a "date" datetime column
df['date']=pd.to_datetime(dict(year=df['year'], month=df['month'], day=1))

#another way to merge dates is  pd.to_datetime({'year':df['year'],'month':df['month'],'day':1})


# Set "date" to be the index
df.set_index('date',inplace=True)
df.index.freq = 'MS'
df.head()
Out[48]:
year month decimal_date average interpolated
date
1958-03-01 1958 3 1958.208 315.71 315.71
1958-04-01 1958 4 1958.292 317.45 317.45
1958-05-01 1958 5 1958.375 317.50 317.50
1958-06-01 1958 6 1958.458 NaN 317.10
1958-07-01 1958 7 1958.542 315.86 315.86
In [49]:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 729 entries, 1958-03-01 to 2018-11-01
Freq: MS
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   year          729 non-null    int64  
 1   month         729 non-null    int64  
 2   decimal_date  729 non-null    float64
 3   average       722 non-null    float64
 4   interpolated  729 non-null    float64
dtypes: float64(3), int64(2)
memory usage: 34.2 KB

Plot the source data

In [50]:
title = 'Monthly Mean CO₂ Levels (ppm) over Mauna Loa, Hawaii'
ylabel='parts per million'
xlabel='' # we don't really need a label here

ax = df['interpolated'].plot(figsize=(12,6),title=title)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Run an ETS Decomposition

In [51]:
result = seasonal_decompose(df['interpolated'], model='add')
result.plot();

Although small in scale compared to the overall values, there is a definite annual seasonality.

This may take awhile as there are a lot more combinations to evaluate.

In [52]:
# For SARIMA Orders we set seasonal=True and pass in an m value
# since seasonality is yearl and we have monthly data then m=12
auto_arima(df['interpolated'],seasonal=True,m=12).summary()
Out[52]:
SARIMAX Results
Dep. Variable: y No. Observations: 729
Model: SARIMAX(0, 1, 2)x(1, 0, [1], 12) Log Likelihood -208.275
Date: Sun, 03 May 2020 AIC 428.550
Time: 13:44:12 BIC 456.091
Sample: 0 HQIC 439.177
- 729
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0001 0.000 1.009 0.313 -0.000 0.000
ma.L1 -0.3562 0.037 -9.719 0.000 -0.428 -0.284
ma.L2 -0.0615 0.033 -1.844 0.065 -0.127 0.004
ar.S.L12 0.9996 0.000 3048.169 0.000 0.999 1.000
ma.S.L12 -0.8654 0.022 -40.083 0.000 -0.908 -0.823
sigma2 0.0963 0.005 20.314 0.000 0.087 0.106
Ljung-Box (Q): 50.10 Jarque-Bera (JB): 4.37
Prob(Q): 0.13 Prob(JB): 0.11
Heteroskedasticity (H): 1.11 Skew: -0.02
Prob(H) (two-sided): 0.40 Kurtosis: 3.38


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Split the data into train/test sets

In [54]:
# Set one year for testing
train = df.iloc[:717]
test = df.iloc[717:]

Fit

In [55]:
model = SARIMAX(train['interpolated'],order=(0,1,3),seasonal_order=(1,0,1,12))
results = model.fit()
results.summary()
Out[55]:
SARIMAX Results
Dep. Variable: interpolated No. Observations: 717
Model: SARIMAX(0, 1, 3)x(1, 0, [1], 12) Log Likelihood -201.190
Date: Sun, 03 May 2020 AIC 414.379
Time: 13:44:44 BIC 441.821
Sample: 03-01-1958 HQIC 424.976
- 11-01-2017
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ma.L1 -0.3541 0.036 -9.824 0.000 -0.425 -0.283
ma.L2 -0.0248 0.035 -0.700 0.484 -0.094 0.045
ma.L3 -0.0869 0.033 -2.606 0.009 -0.152 -0.022
ar.S.L12 0.9997 0.000 3058.958 0.000 0.999 1.000
ma.S.L12 -0.8668 0.023 -38.504 0.000 -0.911 -0.823
sigma2 0.0950 0.005 20.302 0.000 0.086 0.104
Ljung-Box (Q): 43.95 Jarque-Bera (JB): 4.39
Prob(Q): 0.31 Prob(JB): 0.11
Heteroskedasticity (H): 1.15 Skew: 0.02
Prob(H) (two-sided): 0.28 Kurtosis: 3.38


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [56]:
# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA(0,1,3)(1,0,1,12) Predictions')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values).

Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the default typ='linear' we would have seen linear predictions in terms of the differenced endogenous variables.

For more information on these arguments visit https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.predict.html

In [57]:
# Compare predictions to expected values
for i in range(len(predictions)):
    print(f"predicted={predictions[i]:<11.10}, expected={test['interpolated'][i]}")
predicted=406.6098861, expected=406.81
predicted=407.8248514, expected=407.96
predicted=408.5788433, expected=408.32
predicted=409.4838247, expected=409.41
predicted=411.0393447, expected=410.24
predicted=411.642275 , expected=411.24
predicted=410.8630395, expected=410.79
predicted=409.1735271, expected=408.71
predicted=407.0731977, expected=406.99
predicted=405.6227532, expected=405.51
predicted=405.892521 , expected=406.0
predicted=407.4256078, expected=408.02
In [58]:
# Plot predictions against known values
title = 'Monthly Mean CO₂ Levels (ppm) over Mauna Loa, Hawaii'
ylabel='parts per million'
xlabel=''

ax = test['interpolated'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Evaluate the Model

In [59]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(test['interpolated'], predictions)
print(f'SARIMA(0,1,3)(1,0,1,12) MSE Error: {error:11.10}')

from statsmodels.tools.eval_measures import rmse

error = rmse(test['interpolated'], predictions)
print(f'SARIMA(0,1,3)(1,0,1,12) RMSE Error: {error:11.10}')
SARIMA(0,1,3)(1,0,1,12) MSE Error: 0.128018237
SARIMA(0,1,3)(1,0,1,12) RMSE Error: 0.3577963625

Remember that in order to understand and intepret de error you need always to look at the data.

In [60]:
test['interpolated'].mean()
Out[60]:
408.3333333333333

These are outstanding results!, since a RMSE of 0.35 is tiny compared to 408.333

Retrain the model on the full data, and forecast the future

In [61]:
model = SARIMAX(df['interpolated'],order=(0,1,3),seasonal_order=(1,0,1,12))
results = model.fit()
fcast = results.predict(len(df),len(df)+11,typ='levels').rename('SARIMA(0,1,3)(1,0,1,12) Forecast')

# Plot predictions against known values
title = 'Monthly Mean CO₂ Levels (ppm) over Mauna Loa, Hawaii'
ylabel='parts per million'
xlabel=''

ax = df['interpolated'].plot(legend=True,figsize=(12,6),title=title)
fcast.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

SARIMAX

Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors

So far the models we've looked at consider past values of a dataset and past errors to determine future trends, seasonality and forecasted values. We look now to models that encompass these non-seasonal (p,d,q) and seasonal (P,D,Q,m) factors, but introduce the idea that external factors (environmental, economic, etc.) can also influence a time series, and be used in forecasting.

Related Functions:

sarimax.SARIMAX(endog[, exog, order, …])       
sarimax.SARIMAXResults(model, params, …[, …])  Class to hold results from fitting a SARIMAX model.

For Further Reading:

Statsmodels Tutorial:  Time Series Analysis by State Space Methods
Statsmodels Example:  SARIMAX

Perform standard imports and load datasets

In [4]:
import pandas as pd
import numpy as np
%matplotlib inline

# Load specific forecasting tools
from statsmodels.tsa.statespace.sarimax import SARIMAX

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders
from statsmodels.tsa.seasonal import seasonal_decompose      # for ETS Plots
from pmdarima import auto_arima                              # for determining ARIMA orders

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

# Load dataset
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/RestaurantVisitors.csv',index_col='date',parse_dates=True)
df.index.freq = 'D'

Inspect the data

For this section we've built a Restaurant Visitors dataset that was inspired by a recent Kaggle competition. The data considers daily visitors to four restaurants located in the United States, subject to American holidays. For the exogenous variable we'll see how holidays affect patronage. The dataset contains 478 days of restaurant data, plus an additional 39 days of holiday data for forecasting purposes.

Notice that even though the restaurant visitor columns contain integer data, they appear as floats. This is because the bottom of the dataframe has 39 rows of NaN data to accommodate the extra holiday data we'll use for forecasting, and pandas won't allow NaN's as integers. We could leave it like this, but since we have to drop NaN values anyway, let's also convert the columns to dtype int64.

In [5]:
df1 = df.dropna()
df1.tail()
Out[5]:
weekday holiday holiday_name rest1 rest2 rest3 rest4 total
date
2017-04-18 Tuesday 0 na 30.0 30.0 13.0 18.0 91.0
2017-04-19 Wednesday 0 na 20.0 11.0 30.0 18.0 79.0
2017-04-20 Thursday 0 na 22.0 3.0 19.0 46.0 90.0
2017-04-21 Friday 0 na 38.0 53.0 36.0 38.0 165.0
2017-04-22 Saturday 0 na 97.0 20.0 50.0 59.0 226.0
In [ ]:
# Change the dtype of selected columns from float to int (we cannot isk having 1.5 customers)
cols = ['rest1','rest2','rest3','rest4','total']
for col in cols:
    df1[col] = df1[col].astype(int)
df1.head()

Plot the source data

In [6]:
title='Restaurant Visitors'
ylabel='Visitors per day'
xlabel='' # we don't really need a label here

ax = df1['total'].plot(figsize=(16,5),title=title)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

Look at holidays

Rather than prepare a separate plot, we can use matplotlib to shade holidays behind our restaurant data.

In [8]:
title='Restaurant Visitors'
ylabel='Visitors per day'
xlabel='' # we don't really need a label here

ax = df1['total'].plot(figsize=(16,5),title=title)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
for x in df1.query('holiday==1').index:     # for days where holiday == 1  #you can also do it the classical way: df1[df1['holiday']==1].index
    ax.axvline(x=x, color='k', alpha = 0.3);  # add a semi-transparent grey line

Run an ETS Decomposition

In [9]:
result = seasonal_decompose(df1['total'])
result.plot();
In [12]:
result.seasonal.plot(figsize=(18,5));

Test for stationarity

In [10]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")
In [13]:
adf_test(df1['total'])
Augmented Dickey-Fuller Test: 
ADF test statistic       -5.592497
p-value                   0.000001
# lags used              18.000000
# observations          459.000000
critical value (1%)      -3.444677
critical value (5%)      -2.867857
critical value (10%)     -2.570135
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

This may take awhile as there are a lot of combinations to evaluate.

In [ ]:
# For SARIMA Orders we set seasonal=True and pass in an m value
auto_arima(df1['total'],seasonal=True,m=7).summary()

Excellent! This provides an ARIMA Order of (1,0,0) and a seasonal order of (2,0,0,7) Now let's train & test the SARIMA model, evaluate it, then compare the result to a model that uses an exogenous variable.

Split the data into train/test sets

We'll assign 42 days (6 weeks) to the test set so that it includes several holidays.

In [17]:
# Set four weeks for testing
train = df1.iloc[:436]
test = df1.iloc[436:]

Fit a SARIMA(1,0,0)(2,0,0,7) Model

NOTE: To avoid a ValueError: non-invertible starting MA parameters found we're going to set enforce_invertibility to False.

In [18]:
model = SARIMAX(train['total'],order=(1,0,0),seasonal_order=(2,0,0,7),enforce_invertibility=False)
#Why invertible process ? invertibility means that recent lags have more weigh than old lags. In other words
#the coefficient theta are < 1. Statsmodels believes this is the "normal", so if forces the coefficients to be 
#always <1. But we don't want this, so we don't want to enfornce invertibility. We want to allow that the coefficients
#of the lags are => 1
results = model.fit()
results.summary()
Out[18]:
SARIMAX Results
Dep. Variable: total No. Observations: 436
Model: SARIMAX(1, 0, 0)x(2, 0, 0, 7) Log Likelihood -2224.701
Date: Tue, 05 May 2020 AIC 4457.403
Time: 07:41:38 BIC 4473.713
Sample: 01-01-2016 HQIC 4463.840
- 03-11-2017
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.2212 0.047 4.711 0.000 0.129 0.313
ar.S.L7 0.5063 0.036 14.187 0.000 0.436 0.576
ar.S.L14 0.4574 0.037 12.379 0.000 0.385 0.530
sigma2 1520.2899 82.277 18.478 0.000 1359.029 1681.550
Ljung-Box (Q): 83.96 Jarque-Bera (JB): 29.23
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 0.86 Skew: 0.34
Prob(H) (two-sided): 0.37 Kurtosis: 4.07


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [19]:
# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False).rename('SARIMA(1,0,0)(2,0,0,7) Predictions')

Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values).

For more information on these arguments visit https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.predict.html

In [21]:
# Plot predictions against known values
title='Restaurant Visitors'
ylabel='Visitors per day'
xlabel=''

ax = test['total'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
for x in test.query('holiday==1').index: 
    ax.axvline(x=x, color='k', alpha = 0.3);

Evaluate the Model

In [22]:
from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse

error1 = mean_squared_error(test['total'], predictions)
error2 = rmse(test['total'], predictions)

print(f'SARIMA(1,0,0)(2,0,0,7) MSE Error: {error1:11.10}')
print(f'SARIMA(1,0,0)(2,0,0,7) RMSE Error: {error2:11.10}')
SARIMA(1,0,0)(2,0,0,7) MSE Error: 1702.647961
SARIMA(1,0,0)(2,0,0,7) RMSE Error: 41.26315501

Now add the exog variable

In [24]:
model = SARIMAX(train['total'],exog=train['holiday'],order=(1,0,0),seasonal_order=(2,0,0,7),enforce_invertibility=False)
results = model.fit()
results.summary()
Out[24]:
SARIMAX Results
Dep. Variable: total No. Observations: 436
Model: SARIMAX(1, 0, 0)x(2, 0, 0, 7) Log Likelihood -2158.482
Date: Tue, 05 May 2020 AIC 4326.963
Time: 07:43:49 BIC 4347.352
Sample: 01-01-2016 HQIC 4335.010
- 03-11-2017
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
holiday 66.8886 4.241 15.774 0.000 58.577 75.200
ar.L1 0.2145 0.049 4.375 0.000 0.118 0.311
ar.S.L7 0.5147 0.042 12.312 0.000 0.433 0.597
ar.S.L14 0.4575 0.042 10.997 0.000 0.376 0.539
sigma2 1117.3950 73.301 15.244 0.000 973.727 1261.063
Ljung-Box (Q): 100.96 Jarque-Bera (JB): 1.24
Prob(Q): 0.00 Prob(JB): 0.54
Heteroskedasticity (H): 0.91 Skew: 0.11
Prob(H) (two-sided): 0.58 Kurtosis: 3.14


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [25]:
# Obtain predicted values
start=len(train)
end=len(train)+len(test)-1
exog_forecast = test[['holiday']]  # requires two brackets to yield a shape of (35,1)
predictions = results.predict(start=start, end=end, exog=exog_forecast).rename('SARIMAX(1,0,0)(2,0,0,7) Predictions')
In [26]:
# Plot predictions against known values
title='Restaurant Visitors'
ylabel='Visitors per day'
xlabel=''

ax = test['total'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
for x in test.query('holiday==1').index: 
    ax.axvline(x=x, color='k', alpha = 0.3);

We can see that the exogenous variable (holidays) had a positive impact on the forecast by raising predicted values at 3/17, 4/14, 4/16 and 4/17! Let's compare evaluations:

Evaluate the Model

In [27]:
# Print values from SARIMA above
print(f'SARIMA(1,0,0)(2,0,0,7) MSE Error: {error1:11.10}')
print(f'SARIMA(1,0,0)(2,0,0,7) RMSE Error: {error2:11.10}')
print()

error1x = mean_squared_error(test['total'], predictions)
error2x = rmse(test['total'], predictions)

# Print new SARIMAX values
print(f'SARIMAX(1,0,0)(2,0,0,7) MSE Error: {error1x:11.10}')
print(f'SARIMAX(1,0,0)(2,0,0,7) RMSE Error: {error2x:11.10}')
SARIMA(1,0,0)(2,0,0,7) MSE Error: 1702.647961
SARIMA(1,0,0)(2,0,0,7) RMSE Error: 41.26315501

SARIMAX(1,0,0)(2,0,0,7) MSE Error: 950.6693589
SARIMAX(1,0,0)(2,0,0,7) RMSE Error: 30.83292654

Our RMSE is lower, which means adding the exogenous variable helps us to predict better the future

Retrain the model on the full data, and forecast the future

We're going to forecast 39 days into the future, and use the additional holiday data

In [28]:
model = SARIMAX(df1['total'],exog=df1['holiday'],order=(1,0,0),seasonal_order=(2,0,0,7),enforce_invertibility=False)
results = model.fit()
exog_forecast = df[478:][['holiday']]
fcast = results.predict(len(df1),len(df1)+38,exog=exog_forecast).rename('SARIMAX(1,0,0)(2,0,0,7) Forecast')
In [29]:
# Plot the forecast alongside historical values
title='Restaurant Visitors'
ylabel='Visitors per day'
xlabel=''

ax = df1['total'].plot(legend=True,figsize=(16,6),title=title)
fcast.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
for x in df.query('holiday==1').index: 
    ax.axvline(x=x, color='k', alpha = 0.3);

VAR(p)

Vector Autoregression

In our previous SARIMAX example, the forecast variable $y_t$ was influenced by the exogenous predictor variable, but not vice versa. That is, the occurrence of a holiday affected restaurant patronage but not the other way around.

However, there are some cases where variables affect each other. Forecasting: Principles and Practice describes a case where changes in personal consumption expenditures $C_t$ were forecast based on changes in personal disposable income $I_t$.

However, in this case a bi-directional relationship may be more suitable: an increase in $I_t$ will lead to an increase in $C_t$ and vice versa.
An example of such a situation occurred in Australia during the Global Financial Crisis of 2008–2009. The Australian government issued stimulus packages that included cash payments in December 2008, just in time for Christmas spending. As a result, retailers reported strong sales and the economy was stimulated. Consequently, incomes increased.

Aside from investigating multivariate time series, vector autoregression is used for

Formulation

We've seen that an autoregression AR(p) model is described by the following:

     $y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + \varepsilon_{t}$

where $c$ is a constant, $\phi_{1}$ and $\phi_{2}$ are lag coefficients up to order $p$, and $\varepsilon_{t}$ is white noise.

A $K$-dimensional VAR model of order $p$, denoted VAR(p), considers each variable $y_K$ in the system.

For example, The system of equations for a 2-dimensional VAR(1) model is:

    $y_{1,t} = c_1 + \phi_{11,1}y_{1,t-1} + \phi_{12,1}y_{2,t-1} + \varepsilon_{1,t}$
    $y_{2,t} = c_2 + \phi_{21,1}y_{1,t-1} + \phi_{22,1}y_{2,t-1} + \varepsilon_{2,t}$

where the coefficient $\phi_{ii,l}$ captures the influence of the $l$th lag of variable $y_i$ on itself,
the coefficient $\phi_{ij,l}$ captures the influence of the $l$th lag of variable $y_j$ on $y_i$,
and $\varepsilon_{1,t}$ and $\varepsilon_{2,t}$ are white noise processes that may be correlated.

Carrying this further, the system of equations for a 2-dimensional VAR(3) model is:

    $y_{1,t} = c_1 + \phi_{11,1}y_{1,t-1} + \phi_{12,1}y_{2,t-1} + \phi_{11,2}y_{1,t-2} + \phi_{12,2}y_{2,t-2} + \phi_{11,3}y_{1,t-3} + \phi_{12,3}y_{2,t-3} + \varepsilon_{1,t}$
    $y_{2,t} = c_2 + \phi_{21,1}y_{1,t-1} + \phi_{22,1}y_{2,t-1} + \phi_{21,2}y_{1,t-2} + \phi_{22,2}y_{2,t-2} + \phi_{21,3}y_{1,t-3} + \phi_{22,3}y_{2,t-3} + \varepsilon_{2,t}$

and the system of equations for a 3-dimensional VAR(2) model is:

    $y_{1,t} = c_1 + \phi_{11,1}y_{1,t-1} + \phi_{12,1}y_{2,t-1} + \phi_{13,1}y_{3,t-1} + \phi_{11,2}y_{1,t-2} + \phi_{12,2}y_{2,t-2} + \phi_{13,2}y_{3,t-2} + \varepsilon_{1,t}$
    $y_{2,t} = c_2 + \phi_{21,1}y_{1,t-1} + \phi_{22,1}y_{2,t-1} + \phi_{23,1}y_{3,t-1} + \phi_{21,2}y_{1,t-2} + \phi_{22,2}y_{2,t-2} + \phi_{23,2}y_{3,t-2} + \varepsilon_{2,t}$
    $y_{3,t} = c_3 + \phi_{31,1}y_{1,t-1} + \phi_{32,1}y_{2,t-1} + \phi_{33,1}y_{3,t-1} + \phi_{31,2}y_{1,t-2} + \phi_{32,2}y_{2,t-2} + \phi_{33,2}y_{3,t-2} + \varepsilon_{3,t}$

The general steps involved in building a VAR model are:

  • Examine the data
  • Visualize the data
  • Test for stationarity
  • If necessary, transform the data to make it stationary
  • Select the appropriate order p
  • Instantiate the model and fit it to a training set
  • If necessary, invert the earlier transformation
  • Evaluate model predictions against a known test set
  • Forecast the future

Recall that to fit a SARIMAX model we passed one field of data as our endog variable, and another for exog. With VAR, both fields will be passed in as endog.

Related Functions:

vector_ar.var_model.VAR(endog[, exog, …])      Fit VAR(p) process and do lag order selection
vector_ar.var_model.VARResults(endog, …[, …])  Estimate VAR(p) process with fixed number of lags
vector_ar.dynamic.DynamicVAR(data[, …])        Estimates time-varying vector autoregression (VAR(p)) using equation-by-equation least squares

For Further Reading:

Statsmodels Tutorial:  Vector Autoregressions
Forecasting: Principles and Practice:  Vector Autoregressions
Wikipedia:  Vector Autoregression

Perform standard imports and load dataset

For this analysis we'll also compare money to spending. We'll look at the M2 Money Stock which is a measure of U.S. personal assets, and U.S. personal spending. Both datasets are in billions of dollars, monthly, seasonally adjusted. They span the 21 years from January 1995 to December 2015 (252 records).
Sources: https://fred.stlouisfed.org/series/M2SL https://fred.stlouisfed.org/series/PCE

In [95]:
import numpy as np
import pandas as pd
%matplotlib inline

# Load specific forecasting tools
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse

# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

# Load datasets
df = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/M2SL.csv',index_col=0, parse_dates=True)
df.index.freq = 'MS'

sp = pd.read_csv('./original/TSA_COURSE_NOTEBOOKS/Data/PCE.csv',index_col=0, parse_dates=True)
sp.index.freq = 'MS'
In [96]:
df = df.join(sp)
df = df.dropna()
In [97]:
#let's set it to a shorter period of time
df[df.index.to_series().between('1994-12-12', '2015-12-01')].shape
Out[97]:
(252, 2)
In [98]:
df = df[df.index.to_series().between('1994-12-12', '2015-12-01')]
In [99]:
df.columns = ["Money", "Spending"]
In [100]:
title = 'M2 Money Stock vs. Personal Consumption Expenditures'
ylabel='Billions of dollars'
xlabel=''

ax = df['Spending'].plot(figsize=(12,5),title=title,legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel)
df['Money'].plot(legend=True);

Test for stationarity, perform any necessary transformations

In [101]:
def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")
In [102]:
adf_test(df['Money'],title='Money')
Augmented Dickey-Fuller Test: Money
ADF test statistic        4.219049
p-value                   1.000000
# lags used               4.000000
# observations          247.000000
critical value (1%)      -3.457105
critical value (5%)      -2.873314
critical value (10%)     -2.573044
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary
In [103]:
adf_test(df['Spending'], title='Spending')
Augmented Dickey-Fuller Test: Spending
ADF test statistic        0.124119
p-value                   0.967670
# lags used               3.000000
# observations          248.000000
critical value (1%)      -3.456996
critical value (5%)      -2.873266
critical value (10%)     -2.573019
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

Neither variable is stationary, so we'll take a first order difference of the entire DataFrame and re-run the augmented Dickey-Fuller tests. It's advisable to save transformed values in a new DataFrame, as we'll need the original when we later invert the transormations and evaluate the model.

In [104]:
df_transformed = df.diff()
In [105]:
df_transformed = df_transformed.dropna()
adf_test(df_transformed['Money'], title='MoneyFirstDiff')
print()
adf_test(df_transformed['Spending'], title='SpendingFirstDiff')
Augmented Dickey-Fuller Test: MoneyFirstDiff
ADF test statistic       -2.070542
p-value                   0.256523
# lags used              15.000000
# observations          235.000000
critical value (1%)      -3.458487
critical value (5%)      -2.873919
critical value (10%)     -2.573367
Weak evidence against the null hypothesis
Fail to reject the null hypothesis
Data has a unit root and is non-stationary

Augmented Dickey-Fuller Test: SpendingFirstDiff
ADF test statistic     -7.289192e+00
p-value                 1.432037e-10
# lags used             2.000000e+00
# observations          2.480000e+02
critical value (1%)    -3.456996e+00
critical value (5%)    -2.873266e+00
critical value (10%)   -2.573019e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Still they are not stationary. We will need to keep differencing until they are both stationary.

Note that we need to have the same number of rows for both series.

So in case one of that was already statinary and the other not, we still need to difference both of them

In [106]:
df_transformed = df_transformed.diff().dropna()
adf_test(df_transformed['Money'], title='MoneySecondDiff')
print()
adf_test(df_transformed['Spending'], title='SpendingSecondDiff')
Augmented Dickey-Fuller Test: MoneySecondDiff
ADF test statistic     -7.076641e+00
p-value                 4.783030e-10
# lags used             1.400000e+01
# observations          2.350000e+02
critical value (1%)    -3.458487e+00
critical value (5%)    -2.873919e+00
critical value (10%)   -2.573367e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Augmented Dickey-Fuller Test: SpendingSecondDiff
ADF test statistic     -8.757023e+00
p-value                 2.737843e-14
# lags used             8.000000e+00
# observations          2.410000e+02
critical value (1%)    -3.457779e+00
critical value (5%)    -2.873609e+00
critical value (10%)   -2.573202e+00
Strong evidence against the null hypothesis
Reject the null hypothesis
Data has no unit root and is stationary

Train/test split

It will be useful to define a number of observations variable for our test set. For this analysis, let's use 12 months.

In [107]:
nobs=12  #number of observations
train, test = df_transformed[0:-nobs], df_transformed[-nobs:]

print(train.shape)
print(test.shape)
(238, 2)
(12, 2)

VAR Model Order Selection

We'll fit a series of models using the first seven p-values, and base our final selection on the model that provides the lowest AIC and BIC scores.

In [108]:
for i in range(0,15):
    model = VAR(train)
    results = model.fit(i)
    print('Order =', i)
    print('AIC: ', results.aic)
    print('BIC: ', results.bic)
    print()
Order = 0
AIC:  14.752939979130847
BIC:  14.78211872428775

Order = 1
AIC:  14.184746955713575
BIC:  14.272545946628387

Order = 2
AIC:  13.958816257243491
BIC:  14.105588791354744

Order = 3
AIC:  13.855157304170763
BIC:  14.061260270970841

Order = 4
AIC:  13.833776093682921
BIC:  14.099570025633515

Order = 5
AIC:  13.79273881657879
BIC:  14.118587940949801

Order = 6
AIC:  13.804219880373077
BIC:  14.190492172025337

Order = 7
AIC:  13.802755047357277
BIC:  14.249822282489976

Order = 8
AIC:  13.752428927397462
BIC:  14.260666738281762

Order = 9
AIC:  13.77449000181159
BIC:  14.34427793253238

Order = 10
AIC:  13.812515063457573
BIC:  14.444236626686022

Order = 11
AIC:  13.853050184127603
BIC:  14.547092918947623

Order = 12
AIC:  13.890333058237042
BIC:  14.647088589049496

Order = 13
AIC:  13.894122524782878
BIC:  14.713986621311939

Order = 14
AIC:  13.935053123157163
BIC:  14.818425761583914

Order = 8wins

Fit the VAR(5) Model

In [109]:
results = model.fit(8)
results.summary()
Out[109]:
  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Thu, 07, May, 2020
Time:                     08:03:22
--------------------------------------------------------------------
No. of Equations:         2.00000    BIC:                    14.2607
Nobs:                     230.000    HQIC:                   13.9574
Log likelihood:          -2200.24    FPE:                    939374.
AIC:                      13.7524    Det(Omega_mle):         814517.
--------------------------------------------------------------------
Results for equation Money
==============================================================================
                 coefficient       std. error           t-stat            prob
------------------------------------------------------------------------------
const               0.907834         1.772764            0.512           0.609
L1.Money           -0.710324         0.070019          -10.145           0.000
L1.Spending        -0.135029         0.053076           -2.544           0.011
L2.Money           -0.611552         0.083807           -7.297           0.000
L2.Spending        -0.268308         0.073586           -3.646           0.000
L3.Money           -0.392205         0.091055           -4.307           0.000
L3.Spending        -0.307474         0.084536           -3.637           0.000
L4.Money           -0.472933         0.090681           -5.215           0.000
L4.Spending        -0.193552         0.088103           -2.197           0.028
L5.Money           -0.385288         0.092360           -4.172           0.000
L5.Spending        -0.222786         0.086765           -2.568           0.010
L6.Money           -0.288204         0.089853           -3.208           0.001
L6.Spending        -0.185131         0.083488           -2.217           0.027
L7.Money           -0.230822         0.081595           -2.829           0.005
L7.Spending        -0.129116         0.073699           -1.752           0.080
L8.Money           -0.141479         0.068515           -2.065           0.039
L8.Spending        -0.096151         0.052771           -1.822           0.068
==============================================================================

Results for equation Spending
==============================================================================
                 coefficient       std. error           t-stat            prob
------------------------------------------------------------------------------
const               0.308129         2.310014            0.133           0.894
L1.Money            0.174574         0.091239            1.913           0.056
L1.Spending        -0.949143         0.069162          -13.724           0.000
L2.Money            0.081780         0.109206            0.749           0.454
L2.Spending        -0.712347         0.095887           -7.429           0.000
L3.Money            0.013724         0.118650            0.116           0.908
L3.Spending        -0.488045         0.110155           -4.431           0.000
L4.Money           -0.127293         0.118162           -1.077           0.281
L4.Spending        -0.366909         0.114804           -3.196           0.001
L5.Money           -0.069540         0.120350           -0.578           0.563
L5.Spending        -0.395318         0.113060           -3.497           0.000
L6.Money            0.139825         0.117084            1.194           0.232
L6.Spending        -0.305969         0.108790           -2.812           0.005
L7.Money            0.292190         0.106323            2.748           0.006
L7.Spending        -0.253122         0.096034           -2.636           0.008
L8.Money            0.214825         0.089279            2.406           0.016
L8.Spending        -0.155879         0.068764           -2.267           0.023
==============================================================================

Correlation matrix of residuals
               Money  Spending
Money       1.000000 -0.274276
Spending   -0.274276  1.000000

Predict the next 12 values

Unlike the VARMAX model we'll use in upcoming sections, the VAR .forecast() function requires that we pass in a lag order number of previous observations as well. Unfortunately this forecast tool doesn't provide a DateTime index - we'll have to do that manually.

In [110]:
lag_order = results.k_ar
lag_order
Out[110]:
8
In [111]:
# Steps = how many periods in the future you want to predict

# y = the p lagged values right before the test starts (numpy array)
z = results.forecast(y=train.values[-lag_order:], steps=12)
z
Out[111]:
array([[-11.97378484,  35.72884867],
       [ -0.96857657, -14.46781461],
       [  0.24054265,  -4.07393834],
       [ -8.27006038,   6.8545925 ],
       [  4.8827464 ,   7.92189424],
       [ -6.95706612,  -7.54399188],
       [  3.87018532,   9.21347606],
       [ -0.41973345,  -4.01872681],
       [  0.8098162 ,  -7.75635785],
       [  2.58139382,   6.27363839],
       [  0.46848727,  -3.26176991],
       [ -0.98384059,  -1.44714327]])

remeber "z" is differenced, so we cannot directly interpret them

In [118]:
idx = pd.date_range('1/1/2015', periods=12, freq='MS')
df_forecast = pd.DataFrame(z, index=idx, columns=['Money2d','Spending2d'])
df_forecast
Out[118]:
Money2d Spending2d
2015-01-01 -11.973785 35.728849
2015-02-01 -0.968577 -14.467815
2015-03-01 0.240543 -4.073938
2015-04-01 -8.270060 6.854592
2015-05-01 4.882746 7.921894
2015-06-01 -6.957066 -7.543992
2015-07-01 3.870185 9.213476
2015-08-01 -0.419733 -4.018727
2015-09-01 0.809816 -7.756358
2015-10-01 2.581394 6.273638
2015-11-01 0.468487 -3.261770
2015-12-01 -0.983841 -1.447143

Invert the Transformation

Remember that the forecasted values represent second-order differences. To compare them to the original data we have to roll back each difference. To roll back a first-order difference we take the most recent value on the training side of the original series, and add it to a cumulative sum of forecasted values. When working with second-order differences we first must perform this operation on the most recent first-order difference.

Here we'll use the nobs variable we defined during the train/test/split step.

In [119]:
# Money_d2 = 2nd difference

# Add the most recent first difference from the training side of the original dataset to the forecast cumulative sum
df_forecast['Money1d'] = (df['Money'].iloc[-nobs-1]-df['Money'].iloc[-nobs-2]) + df_forecast['Money2d'].cumsum()

# Now build the forecast values from the first difference set
df_forecast['MoneyForecast'] = df['Money'].iloc[-nobs-1] + df_forecast['Money1d'].cumsum()

# Add the most recent first difference from the training side of the original dataset to the forecast cumulative sum
df_forecast['Spending1d'] = (df['Spending'].iloc[-nobs-1]-df['Spending'].iloc[-nobs-2]) + df_forecast['Spending2d'].cumsum()

# Now build the forecast values from the first difference set
df_forecast['SpendingForecast'] = df['Spending'].iloc[-nobs-1] + df_forecast['Spending1d'].cumsum()
In [120]:
df_forecast
Out[120]:
Money2d Spending2d Money1d MoneyForecast Spending1d SpendingForecast
2015-01-01 -11.973785 35.728849 65.426215 11733.426215 50.528849 12115.528849
2015-02-01 -0.968577 -14.467815 64.457639 11797.883854 36.061034 12151.589883
2015-03-01 0.240543 -4.073938 64.698181 11862.582035 31.987096 12183.576978
2015-04-01 -8.270060 6.854592 56.428121 11919.010156 38.841688 12222.418667
2015-05-01 4.882746 7.921894 61.310867 11980.321023 46.763582 12269.182249
2015-06-01 -6.957066 -7.543992 54.353801 12034.674824 39.219591 12308.401840
2015-07-01 3.870185 9.213476 58.223986 12092.898811 48.433067 12356.834906
2015-08-01 -0.419733 -4.018727 57.804253 12150.703064 44.414340 12401.249246
2015-09-01 0.809816 -7.756358 58.614069 12209.317133 36.657982 12437.907228
2015-10-01 2.581394 6.273638 61.195463 12270.512596 42.931620 12480.838848
2015-11-01 0.468487 -3.261770 61.663950 12332.176546 39.669850 12520.508699
2015-12-01 -0.983841 -1.447143 60.680110 12392.856656 38.222707 12558.731406

Plot the results

The VARResults object offers a couple of quick plotting tools:

In [121]:
results.plot();
In [122]:
results.plot_forecast(12);

But for our investigation we want to plot predicted values against our test set.

In [123]:
df['Money'][-nobs:].plot(figsize=(12,5),legend=True).autoscale(axis='x',tight=True)
df_forecast['MoneyForecast'].plot(legend=True);
In [126]:
df['Spending'][-nobs:].plot(figsize=(12,5),legend=True).autoscale(axis='x',tight=True)
df_forecast['SpendingForecast'].plot(legend=True);

Evaluate the model

    $RMSE = \sqrt{{\frac 1 L} \sum\limits_{l=1}^L (y_{T+l} - \hat y_{T+l})^2}$

where $T$ is the last observation period and $l$ is the lag.

In [127]:
RMSE1 = rmse(df['Money'][-nobs:], df_forecast['MoneyForecast'])
print(f'Money VAR(5) RMSE: {RMSE1:.3f}')
Money VAR(5) RMSE: 48.598
In [135]:
df["Money"].mean()
Out[135]:
6978.396428571428
In [128]:
RMSE2 = rmse(df['Spending'][-nobs:], df_forecast['SpendingForecast'])
print(f'Spending VAR(5) RMSE: {RMSE2:.3f}')
Spending VAR(5) RMSE: 61.154
In [136]:
df["Spending"].mean()
Out[136]:
8561.7626984127

The RMSEs are not that bad

Let's compare these results to individual AR(8) models

In [137]:
from statsmodels.tsa.ar_model import AR,ARResults

Money

In [138]:
modelM = AR(train['Money'])
AR5fit1 = modelM.fit(maxlag=8,method='mle')
print(f'Lag: {AR5fit1.k_ar}')
print(f'Coefficients:\n{AR5fit1.params}')
Lag: 8
Coefficients:
const       0.718099
L1.Money   -0.629302
L2.Money   -0.517862
L3.Money   -0.302883
L4.Money   -0.415547
L5.Money   -0.248550
L6.Money   -0.190995
L7.Money   -0.171112
L8.Money   -0.140425
dtype: float64
In [139]:
start=len(train)
end=len(train)+len(test)-1
z1 = pd.DataFrame(AR5fit1.predict(start=start, end=end, dynamic=False),columns=['Money'])
In [140]:
z1
Out[140]:
Money
2015-01-01 -15.295217
2015-02-01 -8.164076
2015-03-01 7.622441
2015-04-01 -7.228420
2015-05-01 -0.496264
2015-06-01 2.309015
2015-07-01 -0.205483
2015-08-01 -0.276575
2015-09-01 4.390986
2015-10-01 -1.453108
2015-11-01 -0.784858
2015-12-01 1.459523

Invert the Transformation, Evaluate the Forecast

In [141]:
# Add the most recent first difference from the training set to the forecast cumulative sum
z1['Money1d'] = (df['Money'].iloc[-nobs-1]-df['Money'].iloc[-nobs-2]) + z1['Money'].cumsum()

# Now build the forecast values from the first difference set
z1['MoneyForecast'] = df['Money'].iloc[-nobs-1] + z1['Money1d'].cumsum()
In [150]:
RMSE3 = rmse(df['Money'][-nobs:], z1['MoneyForecast'])

print(f'Money VAR(8) RMSE: {RMSE1:.3f}')
print(f'Money  AR(8) RMSE: {RMSE3:.3f}')
Money VAR(8) RMSE: 48.598
Money  AR(8) RMSE: 31.488

Personal Spending

In [145]:
modelS = AR(train['Spending'])
AR8fit2 = modelS.fit(maxlag=8,method='mle')
print(f'Lag: {AR5fit2.k_ar}')
print(f'Coefficients:\n{AR5fit2.params}')
Lag: 5
Coefficients:
const          0.257647
L1.Spending   -0.917038
L2.Spending   -0.684009
L3.Spending   -0.457346
L4.Spending   -0.276680
L5.Spending   -0.162721
dtype: float64
In [146]:
z2 = pd.DataFrame(AR8fit2.predict(start=start, end=end, dynamic=False),columns=['Spending'])
z2
Out[146]:
Spending
2015-01-01 31.367550
2015-02-01 -3.140180
2015-03-01 -3.357115
2015-04-01 4.397118
2015-05-01 1.782896
2015-06-01 -10.029480
2015-07-01 11.297009
2015-08-01 -8.019501
2015-09-01 -1.048731
2015-10-01 4.842837
2015-11-01 -0.869743
2015-12-01 -1.095601

Invert the Transformation, Evaluate the Forecast

In [147]:
# Add the most recent first difference from the training set to the forecast cumulative sum
z2['Spending1d'] = (df['Spending'].iloc[-nobs-1]-df['Spending'].iloc[-nobs-2]) + z2['Spending'].cumsum()

# Now build the forecast values from the first difference set
z2['SpendingForecast'] = df['Spending'].iloc[-nobs-1] + z2['Spending1d'].cumsum()
In [149]:
RMSE4 = rmse(df['Spending'][-nobs:], z2['SpendingForecast'])

print(f'Spending VAR(8) RMSE: {RMSE2:.3f}')
print(f'Spending  AR(8) RMSE: {RMSE4:.3f}')
Spending VAR(8) RMSE: 61.154
Spending  AR(8) RMSE: 66.966

Looks like for the Spending variable, the VAR(8) performs better but not in the Money variable

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: